두 파일에서 줄의 교차점 찾기

Question 1

단순한comm+sort해결책:

comm -12 <(sort file1) <(sort file2)

-12- 공통 행(두 파일 모두에서 발생)만 출력되도록 열 1합계(및 각각 고유 행 )를 억제합니다.2FILE1FILE2

Answer

단순한comm+sort해결책:

comm -12 <(sort file1) <(sort file2)

-12- 공통 행(두 파일 모두에서 발생)만 출력되도록 열 1합계(및 각각 고유 행 )를 억제합니다.2FILE1FILE2

Question 2

에서는 awk첫 번째 파일을 메모리에 완전히 로드합니다.

$ awk 'NR==FNR { lines[$0]=1; next } $0 in lines' file1 file2 
67
102

또는 특정 행이 발생하는 횟수를 추적하려는 경우:

$ awk 'NR==FNR { lines[$0] += 1; next } lines[$0] {print; lines[$0] -= 1}' file1 file2

join이 작업을 수행할 수 있지만 입력 파일을 정렬해야 하므로 먼저 정렬해야 하며 그렇게 하면 원래 순서가 손실됩니다.

$ join <(sort file1) <(sort file2)
102
67

Answer

에서는 awk첫 번째 파일을 메모리에 완전히 로드합니다.

$ awk 'NR==FNR { lines[$0]=1; next } $0 in lines' file1 file2 
67
102

또는 특정 행이 발생하는 횟수를 추적하려는 경우:

$ awk 'NR==FNR { lines[$0] += 1; next } lines[$0] {print; lines[$0] -= 1}' file1 file2

join이 작업을 수행할 수 있지만 입력 파일을 정렬해야 하므로 먼저 정렬해야 하며 그렇게 하면 원래 순서가 손실됩니다.

$ join <(sort file1) <(sort file2)
102
67

Question 3

앗

awk 'NR==FNR { p[NR]=$0; next; }
   { for(val in p) if($0==p[val]) { delete p[val]; print; } }' file1 file2

이는 (대형 파일의 경우) 동일한 항목을 여러 번 인쇄하고 일치 후 항목을 다시 확인하는 것을 생략하므로 가장 빠르기 때문에 좋은 솔루션입니다.

grep

grep -Fxf file1 file2

동일한 항목이 에 여러 번 나타나면 여러 번 출력됩니다 file2.

유형

재미삼아 (보다 훨씬 느려야 함 grep):

sort -u file1 >t1
sort -u file2 >t2
sort t1 t2 | uniq -d

Answer

앗

awk 'NR==FNR { p[NR]=$0; next; }
   { for(val in p) if($0==p[val]) { delete p[val]; print; } }' file1 file2

이는 (대형 파일의 경우) 동일한 항목을 여러 번 인쇄하고 일치 후 항목을 다시 확인하는 것을 생략하므로 가장 빠르기 때문에 좋은 솔루션입니다.

grep

grep -Fxf file1 file2

동일한 항목이 에 여러 번 나타나면 여러 번 출력됩니다 file2.

유형

재미삼아 (보다 훨씬 느려야 함 grep):

sort -u file1 >t1
sort -u file2 >t2
sort t1 t2 | uniq -d

Question 4

약간 다르고 awk동등한 perl버전

3회 연속으로 보고서를 실행하는 데 걸리는 시간

$ # just realized shuf -n2000000 -i1-2352452 can be used too ;)
$ shuf -i1-2352452 | head -n2000000 > f1
$ shuf -i1-2352452 | head -n2000000 > f2

$ time awk 'NR==FNR{a[$1]; next} $0 in a' f1 f2 > t1
real    0m3.322s
real    0m3.094s
real    0m3.029s

$ time awk 'BEGIN{while( (getline k < "f1")>0 ){a[k]}} $0 in a' f2 > t2
real    0m2.731s
real    0m2.777s
real    0m2.801s

$ time perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <f1 f2 > t3
real    0m2.643s
real    0m2.690s
real    0m2.630s

$ diff -s t1 t2
Files t1 and t2 are identical
$ diff -s t1 t3
Files t1 and t3 are identical

$ du -h f1 f2 t1
15M f1
15M f2
13M t1

Answer

약간 다르고 awk동등한 perl버전

3회 연속으로 보고서를 실행하는 데 걸리는 시간

$ # just realized shuf -n2000000 -i1-2352452 can be used too ;)
$ shuf -i1-2352452 | head -n2000000 > f1
$ shuf -i1-2352452 | head -n2000000 > f2

$ time awk 'NR==FNR{a[$1]; next} $0 in a' f1 f2 > t1
real    0m3.322s
real    0m3.094s
real    0m3.029s

$ time awk 'BEGIN{while( (getline k < "f1")>0 ){a[k]}} $0 in a' f2 > t2
real    0m2.731s
real    0m2.777s
real    0m2.801s

$ time perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <f1 f2 > t3
real    0m2.643s
real    0m2.690s
real    0m2.630s

$ diff -s t1 t2
Files t1 and t2 are identical
$ diff -s t1 t3
Files t1 and t3 are identical

$ du -h f1 f2 t1
15M f1
15M f2
13M t1

두 파일에서 줄의 교차점 찾기

고쳐 쓰다:

답변1

답변2

답변3

답변4

관련 정보