6개 파일 간의 공통점 찾기

Question 1

함께 awk다음을 수행할 수 있습니다.

#skip if multiple appearance in one file
{if ( seenin[$0] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$0]=seenin[$0]" "FILENAME ; nseen[$0]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

제한 사항: 모든 행이 RAM에 보관되므로 메모리입니다.

발생 횟수를 기준으로 정렬하려면 그에 따라 인쇄 명령을 조정해야 합니다(예: 값 기준 정렬) nseen. 쉽습니다 gawk. END블록에 다음 before -loop를 추가하세요 for.

PROCINFO["sorted_in"]="@val_num_desc"

입력 파일:

$ cat file1
a
a
b
b
c
d
e

$ cat file2
c
c
x
z
e
y
z
f

$ cat file3
f
i
a
c
z
i
k

출력( gawk배열 순회 기능 포함 PROCINFO)

$awk -f compare_lines_multifiles.awk file1 file2 file3
line "c" seen in 3 files:  file1 file2 file3
line "z" seen in 2 files:  file2 file3
line "a" seen in 2 files:  file1 file3
line "e" seen in 2 files:  file1 file2
line "f" seen in 2 files:  file2 file3

편집하다:

제공하신 파일은 MSDOS 형식입니다. 변환하여

 dos2unix file1.txt file2.txt ....

또는 에서 레코드 구분 기호를 조정하십시오 awk. 코드의 첫 번째 항목으로 다음을 추가합니다.

 BEGIN { RS="\r\n" }

편집 2: 파일에 불규칙한 구분 기호가 있습니다. 문제는 a<tab>b과 a<tab>b<tab>가 다른 행으로 처리되는 반면, 동일하다고 생각할 수도 있다는 것입니다.

파일당 두 개의 관심 필드가 있는 특별한 경우에는 전체 행보다는 두 필드의 내용을 비교하는 것이 좋습니다. MSDOS 형식도 고려하십시오.

BEGIN { RS="\r\n" }
#skip if multiple appearance in one file
{if ( seenin[$1"\t"$2] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$1"\t"$2]=seenin[$1"\t"$2]" "FILENAME ; nseen[$1"\t"$2]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

결국 6개 파일 모두 더 많이 중복되었습니다. 탭 구분 기호가 있는 두 필드에 초점을 맞추고 한 줄의 출력을 인쇄합니다.

Answer

함께 awk다음을 수행할 수 있습니다.

#skip if multiple appearance in one file
{if ( seenin[$0] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$0]=seenin[$0]" "FILENAME ; nseen[$0]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

제한 사항: 모든 행이 RAM에 보관되므로 메모리입니다.

발생 횟수를 기준으로 정렬하려면 그에 따라 인쇄 명령을 조정해야 합니다(예: 값 기준 정렬) nseen. 쉽습니다 gawk. END블록에 다음 before -loop를 추가하세요 for.

PROCINFO["sorted_in"]="@val_num_desc"

입력 파일:

$ cat file1
a
a
b
b
c
d
e

$ cat file2
c
c
x
z
e
y
z
f

$ cat file3
f
i
a
c
z
i
k

출력( gawk배열 순회 기능 포함 PROCINFO)

$awk -f compare_lines_multifiles.awk file1 file2 file3
line "c" seen in 3 files:  file1 file2 file3
line "z" seen in 2 files:  file2 file3
line "a" seen in 2 files:  file1 file3
line "e" seen in 2 files:  file1 file2
line "f" seen in 2 files:  file2 file3

편집하다:

제공하신 파일은 MSDOS 형식입니다. 변환하여

 dos2unix file1.txt file2.txt ....

또는 에서 레코드 구분 기호를 조정하십시오 awk. 코드의 첫 번째 항목으로 다음을 추가합니다.

 BEGIN { RS="\r\n" }

편집 2: 파일에 불규칙한 구분 기호가 있습니다. 문제는 a<tab>b과 a<tab>b<tab>가 다른 행으로 처리되는 반면, 동일하다고 생각할 수도 있다는 것입니다.

파일당 두 개의 관심 필드가 있는 특별한 경우에는 전체 행보다는 두 필드의 내용을 비교하는 것이 좋습니다. MSDOS 형식도 고려하십시오.

BEGIN { RS="\r\n" }
#skip if multiple appearance in one file
{if ( seenin[$1"\t"$2] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$1"\t"$2]=seenin[$1"\t"$2]" "FILENAME ; nseen[$1"\t"$2]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

결국 6개 파일 모두 더 많이 중복되었습니다. 탭 구분 기호가 있는 두 필드에 초점을 맞추고 한 줄의 출력을 인쇄합니다.

Question 2

나는 다른 접근법을 제안하고 싶습니다. 모두 반복하면서 sort각 uniq -c행이 몇 번이나 표시되는지 계산해 보세요.

sort 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt | uniq -c

그러면 각 줄이 한 번씩 인쇄되지만 해당 줄이 표시된 횟수도 인쇄됩니다. 예를 들어 다음과 같은 세 개의 파일이 있다고 가정해 보겠습니다.

$ cat file1 
dog
cat
bird

$ cat file2
fly
bird
moose

$ cat file3
bird
dog
flea

다음과 같은 결과가 출력됩니다.

$ sort file1 file2 file3 | uniq -c
      3 bird
      1 cat
      2 dog
      1 flea
      1 fly
      1 moose

따라서 발견 횟수에 따라 줄을 구분하려면 다음을 수행하여 3개(또는 귀하의 경우 6개) 파일 모두에 나타나는 줄만 볼 수 있습니다.

$ sort file1 file2 file3 | uniq -c | awk '$1==3'
  3 bird
$ sort file1 file2 file3 | uniq -c | awk '$1==2'
      2 dog
$ sort file1 file2 file3 | uniq -c | awk '$1==1'
      1 cat
      1 flea
      1 fly
      1 moose

Answer

나는 다른 접근법을 제안하고 싶습니다. 모두 반복하면서 sort각 uniq -c행이 몇 번이나 표시되는지 계산해 보세요.

sort 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt | uniq -c

그러면 각 줄이 한 번씩 인쇄되지만 해당 줄이 표시된 횟수도 인쇄됩니다. 예를 들어 다음과 같은 세 개의 파일이 있다고 가정해 보겠습니다.

$ cat file1 
dog
cat
bird

$ cat file2
fly
bird
moose

$ cat file3
bird
dog
flea

다음과 같은 결과가 출력됩니다.

$ sort file1 file2 file3 | uniq -c
      3 bird
      1 cat
      2 dog
      1 flea
      1 fly
      1 moose

따라서 발견 횟수에 따라 줄을 구분하려면 다음을 수행하여 3개(또는 귀하의 경우 6개) 파일 모두에 나타나는 줄만 볼 수 있습니다.

$ sort file1 file2 file3 | uniq -c | awk '$1==3'
  3 bird
$ sort file1 file2 file3 | uniq -c | awk '$1==2'
      2 dog
$ sort file1 file2 file3 | uniq -c | awk '$1==1'
      1 cat
      1 flea
      1 fly
      1 moose

Question 3

첫 번째 시도는 올바른 접근 방식입니다.

comm -12 2.txt 3.txt | comm -12 - 4.txt | comm -12 - 5.txt | comm -12 - 6.txt | comm -12 - 7.txt

이는 작업을 병렬로 완료하는 스트림처럼 작동합니다. 원칙적으로 수백만 줄의 파일을 이런 방식으로 처리할 수 있습니다.

당신이 직면한 문제의사소통(1) 입력 문제, 즉 공백 및 줄 끝으로 인해 발생한 것 같습니다. 이런 것들을 먼저 정리해보면 원래의 방법이 빠르고 편리하다는 것을 알 수 있을 것입니다.

이것을 보여주는 예가 있습니다. 소수 배열로 나눌 수 있는 숫자를 찾으세요.

$ for D in 2 3 5 7 11 13 
> do seq 1 1000 | 
> awk -v D=$D '$0 % D == 0 { print $0 }' | 
> sort > $D
> done

$ comm -12 2 3 | comm -12 - 5 | comm -12 - 7 
210
420
630
840

1부터 1000 사이의 어떤 숫자도 2, 3, 5, 7, 11로 나누어지지 않는다는 사실이 밝혀졌습니다.

Answer