여러 행의 데이터를 공통 필드로 병합

여러 행의 데이터를 공통 필드로 병합

세미콜론(;)으로 구분된 9개 이상의 데이터 행이 있는 대용량 파일이 있는데, 열 3(으로 구분)의 데이터가 열 5의 데이터와 일치하는 행을 병합하려고 합니다. 데이터는 Linux 상자에 저장되어 있으며 일반적인 awk/perl 도구가 있지만 사용 방법을 모릅니다.

문서:

Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
orion.uk.localhost.com;XY01123;Machine-apache-ua01;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
orion.uk.localhost.com;XY01123;Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

예상 출력:

Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

병합 방법에 대한 아이디어를 주시면 감사하겠습니다.

답변1

더 우아한 방법이 있을 수 있지만 awk여기에 가능한 스크립트가 있습니다.

BEGIN { FS=";" ; OFS=";" }
NR==1 { print $0 }
NR>1 {
    if ( b[$5]=="" ) {
        a[$5]=$0
        b[$5]=$3
    }
    else {
        b[$5]=b[$5]","$3
        $3=b[$5]
        a[$5]=$0
    }
}
END {
    for (c in a) {
        print a[c]
    }
}

설명하다:

  • BEGIN입력 및 출력 필드 구분 기호로 세미콜론 설정
  • NR==1첫 번째 줄(헤더)만 인쇄하면 됩니다. 아무 것도 할 필요가 없습니다.
  • NR>1다른 노선의 경우:
    • b[$5]필드 3 항목의 (점점 늘어나는) 쉼표로 구분된 목록을 포함하는 필드 5 값으로 색인화된 배열입니다.
    • a[$5]수정된 행을 포함하는(즉, 쉼표로 구분된 필드 3 값 포함) 필드 5 값으로 색인화된 배열입니다.
    • 설정되지 않은 경우 b[$5](값이 처음 발생) a[$5]행 및 b[$5]필드 3 으로 설정됩니다.
    • 그렇지 않은 경우( b[$5]이미 설정됨) 쉼표 구분 기호가 있는 필드 3을 추가하고 b[$5]이 행의 필드 3을 이것으로 바꾼 다음 a[$5]변경된 행을 다음으로 바꿉니다.
    • ENDc배열의 모든 인덱스 값에 대해 a배열 요소(즉, 원하는 행)를 인쇄합니다.

출력을 정렬 하는 방법을 잘 모르겠지만 awk결과는 다음과 같습니다.

Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External

답변2

거기 있어요 sqlite? 라인에 합류하는 방법에 대한 나의 이해가 정확합니까?

sqlite> .separator ;
sqlite> .import file.txt alldata
sqlite> select "ENV URL", group_concat("Machine") from alldata group by "ENV URL";
dev.matrix.localhost.com;Machine-apache-dev1,Machine-apache-dev2
per.Upgrade.uk.localhost.com;Machine-apache-pf01,Machine-apache-pf02
test.matrix.localhost.com;Machine-apache-bcp1,Machine-apache-prd1
uat.matrix.localhost.com;Machine-apache-uat1,Machine-apache-uat2
uat.orion.uk.localhost.com;Machine-apache-ua01,Machine-apache-ua02

또는 비대화형:

echo 'select "ENV URL", group_concat("Machine") from alldata group by "ENV URL";' \
  | sqlite3 -separator ";" -cmd ".import file.txt alldata" -batch

답변3

Perl에서 배열 해싱 사용(스플라이스를 사용하여 각 병합 후 집계 필드를 제거하고 다시 삽입):

$ perl -F\; -alne '

  if($.==1) {
    print;
    next;
  }

  if(!exists $HoA{$F[4]}) {
    $HoA{$F[4]} = [ @F ];
  }
  else {
    splice @{ $HoA{$F[4]} }, 2, 0, join ",", (splice @{ $HoA{$F[4]} }, 2, 1), $F[2];
  }

  END {
    for $k (keys %HoA) {
      print join ";", @{ $HoA{$k} };
    }
  }
  ' data
Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

또는 GNU를 사용하여 ( 중복 필드를 제거하려면 datamasha 사용 ):cutgroupby

$ datamash -Hst\; groupby 5 unique 1-2 collapse 3 unique 4-9 < data | cut -d\; -f2-
unique(Domain Name);unique(ID);collapse(Machine);unique(Environment);unique(ENV URL);unique(Start Date);unique(End Date);unique(Disk Size);unique(Used)
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00,22 March 2013 06:16:10 GMT;16 April 2018 07:36:33 GMT+01:00,22 March 2018 06:46:10 GMT;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

관련 정보