세미콜론(;)으로 구분된 9개 이상의 데이터 행이 있는 대용량 파일이 있는데, 열 3(으로 구분)의 데이터가 열 5의 데이터와 일치하는 행을 병합하려고 합니다. 데이터는 Linux 상자에 저장되어 있으며 일반적인 awk/perl 도구가 있지만 사용 방법을 모릅니다.
문서:
Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
orion.uk.localhost.com;XY01123;Machine-apache-ua01;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
orion.uk.localhost.com;XY01123;Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
예상 출력:
Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
병합 방법에 대한 아이디어를 주시면 감사하겠습니다.
답변1
더 우아한 방법이 있을 수 있지만 awk
여기에 가능한 스크립트가 있습니다.
BEGIN { FS=";" ; OFS=";" }
NR==1 { print $0 }
NR>1 {
if ( b[$5]=="" ) {
a[$5]=$0
b[$5]=$3
}
else {
b[$5]=b[$5]","$3
$3=b[$5]
a[$5]=$0
}
}
END {
for (c in a) {
print a[c]
}
}
설명하다:
BEGIN
입력 및 출력 필드 구분 기호로 세미콜론 설정NR==1
첫 번째 줄(헤더)만 인쇄하면 됩니다. 아무 것도 할 필요가 없습니다.NR>1
다른 노선의 경우:b[$5]
필드 3 항목의 (점점 늘어나는) 쉼표로 구분된 목록을 포함하는 필드 5 값으로 색인화된 배열입니다.a[$5]
수정된 행을 포함하는(즉, 쉼표로 구분된 필드 3 값 포함) 필드 5 값으로 색인화된 배열입니다.- 설정되지 않은 경우
b[$5]
(값이 처음 발생)a[$5]
행 및b[$5]
필드 3 으로 설정됩니다. - 그렇지 않은 경우(
b[$5]
이미 설정됨) 쉼표 구분 기호가 있는 필드 3을 추가하고b[$5]
이 행의 필드 3을 이것으로 바꾼 다음a[$5]
변경된 행을 다음으로 바꿉니다. END
c
배열의 모든 인덱스 값에 대해a
배열 요소(즉, 원하는 행)를 인쇄합니다.
출력을 정렬 하는 방법을 잘 모르겠지만 awk
결과는 다음과 같습니다.
Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
답변2
거기 있어요 sqlite
? 라인에 합류하는 방법에 대한 나의 이해가 정확합니까?
sqlite> .separator ;
sqlite> .import file.txt alldata
sqlite> select "ENV URL", group_concat("Machine") from alldata group by "ENV URL";
dev.matrix.localhost.com;Machine-apache-dev1,Machine-apache-dev2
per.Upgrade.uk.localhost.com;Machine-apache-pf01,Machine-apache-pf02
test.matrix.localhost.com;Machine-apache-bcp1,Machine-apache-prd1
uat.matrix.localhost.com;Machine-apache-uat1,Machine-apache-uat2
uat.orion.uk.localhost.com;Machine-apache-ua01,Machine-apache-ua02
또는 비대화형:
echo 'select "ENV URL", group_concat("Machine") from alldata group by "ENV URL";' \
| sqlite3 -separator ";" -cmd ".import file.txt alldata" -batch
답변3
Perl에서 배열 해싱 사용(스플라이스를 사용하여 각 병합 후 집계 필드를 제거하고 다시 삽입):
$ perl -F\; -alne '
if($.==1) {
print;
next;
}
if(!exists $HoA{$F[4]}) {
$HoA{$F[4]} = [ @F ];
}
else {
splice @{ $HoA{$F[4]} }, 2, 0, join ",", (splice @{ $HoA{$F[4]} }, 2, 1), $F[2];
}
END {
for $k (keys %HoA) {
print join ";", @{ $HoA{$k} };
}
}
' data
Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
또는 GNU를 사용하여 ( 중복 필드를 제거하려면 datamash
a 사용 ):cut
groupby
$ datamash -Hst\; groupby 5 unique 1-2 collapse 3 unique 4-9 < data | cut -d\; -f2-
unique(Domain Name);unique(ID);collapse(Machine);unique(Environment);unique(ENV URL);unique(Start Date);unique(End Date);unique(Disk Size);unique(Used)
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00,22 March 2013 06:16:10 GMT;16 April 2018 07:36:33 GMT+01:00,22 March 2018 06:46:10 GMT;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal