파일 B를 A와 비교하고 awk, sed 또는 grep을 사용하여 A에서 데이터를 추출합니다.

Question 1

당신이 사용할 수있는 awk:

awk 'NR==FNR{         # On the first file,
       a[$0];         # store the content in the array a
       next
     } 
     {                        # On the second file, 
         for(i in a)          # for all element in the array a,
            if(index($0,i)) { # check if there is match in the current record
               print "C" $0   # in that case print it with the record separator
               next
            }
     }' fileB RS='\nC' fileA
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

Answer

당신이 사용할 수있는 awk:

awk 'NR==FNR{         # On the first file,
       a[$0];         # store the content in the array a
       next
     } 
     {                        # On the second file, 
         for(i in a)          # for all element in the array a,
            if(index($0,i)) { # check if there is match in the current record
               print "C" $0   # in that case print it with the record separator
               next
            }
     }' fileB RS='\nC' fileA
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

Question 2

C <word>와 사이의 부분을 정확히 일치시키려는 경우 [PATH:...](그리고 *샘플의 부분은 실제 데이터의 일부가 아니라 강조를 위한 것일 뿐이라고 가정) 다음을 수행할 수 있습니다.

awk '
  !start {all_strings[$0]; next}
  /^C/ {
    key = $0

    # strip the leading C <word>:
    sub(/^C[[:blank:]]+[^[:blank:]]+[[:blank:]]*/, "", key)

    # strip the trailing [...]:
    sub(/[[:blank:]]*\[[^]]*][[:blank:]]*$/, "", key)
    selected = key in all_strings
  }
  selected' fileB start=1 fileA

신뢰성을 높이는 것 외에도(예: Bacterial secretion하나의 레코드만 일치 Bacterial secretion, 일치하지 않음 Bacterial secretion system) 파일을 한 번만 읽고 일치는 많은 하위 문자열 검색이나 정규 표현식 일치가 아닌 해시 테이블 조회이므로 매우 효율적입니다.

Answer

C <word>와 사이의 부분을 정확히 일치시키려는 경우 [PATH:...](그리고 *샘플의 부분은 실제 데이터의 일부가 아니라 강조를 위한 것일 뿐이라고 가정) 다음을 수행할 수 있습니다.

awk '
  !start {all_strings[$0]; next}
  /^C/ {
    key = $0

    # strip the leading C <word>:
    sub(/^C[[:blank:]]+[^[:blank:]]+[[:blank:]]*/, "", key)

    # strip the trailing [...]:
    sub(/[[:blank:]]*\[[^]]*][[:blank:]]*$/, "", key)
    selected = key in all_strings
  }
  selected' fileB start=1 fileA

신뢰성을 높이는 것 외에도(예: Bacterial secretion하나의 레코드만 일치 Bacterial secretion, 일치하지 않음 Bacterial secretion system) 파일을 한 번만 읽고 일치는 많은 하위 문자열 검색이나 정규 표현식 일치가 아닌 해시 테이블 조회이므로 매우 효율적입니다.

Question 3

루프를 사용하면 쓰러질 것이라고 확신하지만 그래도... 여기에 한 가지 접근 방식이 있습니다.

#!/bin/bash

while read -r line; do
        sed -n "/$line/,/^C/p" fileA | sed '$d'
        done < fileB

예:

./bacteria.sh 
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

샘플 파일은 어디에 있나요 fileA?fileB

정규식 분석:

sed -n "/$line/,/^C/p" fileA | sed '$d'

$line문자로 시작하는 줄과 다음 줄 사이의 줄을 인쇄합니다 C. 단, sed '$d'마지막 줄은 "정지 표시" 역할만 하므로 제외( )합니다.

sed --version
sed (GNU sed) 4.2.2

bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)

Answer

루프를 사용하면 쓰러질 것이라고 확신하지만 그래도... 여기에 한 가지 접근 방식이 있습니다.

#!/bin/bash

while read -r line; do
        sed -n "/$line/,/^C/p" fileA | sed '$d'
        done < fileB

예:

./bacteria.sh 
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

샘플 파일은 어디에 있나요 fileA?fileB

정규식 분석:

sed -n "/$line/,/^C/p" fileA | sed '$d'

$line문자로 시작하는 줄과 다음 줄 사이의 줄을 인쇄합니다 C. 단, sed '$d'마지막 줄은 "정지 표시" 역할만 하므로 제외( )합니다.

sed --version
sed (GNU sed) 4.2.2

bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)

Question 4

데이터는 새 줄로 시작하는 레코드 fileA로 나뉩니다 . 각 레코드는 새 줄로 시작하는 inte 필드 C로 나뉩니다 .D

행을 읽고 fileB이를 사용하여 각 레코드의 첫 번째 필드를 쿼리해야 합니다 fileA.

while read -r query; do
    awk -vq="$query" 'BEGIN { RS="^C|\nC"; FS=OFS="\nD" } $1 ~ q {print "C" $0}' fileA
done <fileB

줄 시작 부분 어디든 RS일치하도록 레코드 구분 기호( )를 설정했습니다.C또는줄 바꿈 문자 뒤에 있으면 첫 번째 레코드의 어떤 항목도 올바르게 일치하지 않을 수 있습니다. awk변수를 사용하여 q파일에서 읽은 값을 보관하고 각 레코드의 첫 번째 필드를 해당 값과 일치시킵니다.

결과:

C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD
D      NT05HA_1310 protein-export membrane protein SecF
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

Answer

데이터는 새 줄로 시작하는 레코드 fileA로 나뉩니다 . 각 레코드는 새 줄로 시작하는 inte 필드 C로 나뉩니다 .D

행을 읽고 fileB이를 사용하여 각 레코드의 첫 번째 필드를 쿼리해야 합니다 fileA.

while read -r query; do
    awk -vq="$query" 'BEGIN { RS="^C|\nC"; FS=OFS="\nD" } $1 ~ q {print "C" $0}' fileA
done <fileB

줄 시작 부분 어디든 RS일치하도록 레코드 구분 기호( )를 설정했습니다.C또는줄 바꿈 문자 뒤에 있으면 첫 번째 레코드의 어떤 항목도 올바르게 일치하지 않을 수 있습니다. awk변수를 사용하여 q파일에서 읽은 값을 보관하고 각 레코드의 첫 번째 필드를 해당 값과 일치시킵니다.

결과:

C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD
D      NT05HA_1310 protein-export membrane protein SecF
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

파일 B를 A와 비교하고 awk, sed 또는 grep을 사용하여 A에서 데이터를 추출합니다.

답변1

답변2

답변3

답변4

관련 정보