awk의 while문

awk의 while문

다음과 같은 입력이 있습니다.

cat moldata
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

>species_2
CCAAGGTCATCCATGACAACTTTGAGATCATTGAAGGCCTGATGACCACTGTACACGCCACCACCGCTACTCAGAAGA
GTCGACGGACCTTCCGGTAAACTCTGGCGTGATGGTCGTGGCGCTCAACAAAACATCATTCCCGCCTCTACTGGTGCT

>species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG

>species_4
GGTCCTTTGAAGGGTATTCTTGGATACACCGAAGATCAAGTTGTGTCCACCGACTTTGTTGGAGACACACACTCTTCA
CTTTGACGCTGCTGCTGGTATCTCCCTCAACGATAACTTCGTCAAACTTATCAGCTGGTACGACAATGAATATGGATA

>species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG

여기에 나열된 종에 대해 몰다타에서 종 기록을 검색하고 싶습니다(종 이름 + 다음 종 기록까지 그 뒤의 줄 블록).

cat species_list
species_1
species_3
species_5

다음 출력을 얻습니다.

cat output
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

>species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG

>species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG

나는 awkwhile 루프에서 그것을 하려고 합니다:

while read line; 
do 
    if grep -q "$line" moldata; 
    then echo $line |  awk -v line=${line} 'BEGIN {RS=">"} /line/ {print $0}' moldata >> output; 
    else echo "$line not found"; 
    fi; 
done < species_list

getline이 옵션 에 대해 읽었 awk지만 작동시킬 수 없습니다.

답변1

이 작업을 고집하는 경우 awk:

( echo -n '/^>('; paste -sd\| - <species_list | tr -d '\n'; echo -n ')/,/^$/' ) | \
    awk -f - moldata

답변2

단일 레코드의 경우 다음을 수행할 수 있습니다.

awk '/species_1/{print;while (getline line){if(line !~/species/) print line; else break} }' input.txt                  
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

여러 프로젝트의 경우 다음 지침을 따르는 것이 좋습니다.

$ while IFS= read -r line                                                                                                
> do
>   awk -v spec="$line" '$0~spec{print;while (getline line){if(line !~/species/) print line; else break} }' input.txt    
> done < species.txt                                                                                                     
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

>species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG

>species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG

두 개의 파일을 매개변수로 사용하는 대체 솔루션

OP 예제의 데이터가 항상 종 이름 뒤에 두 줄이라는 것을 알고 있으면 종 이름을 species.txt배열에 로드한 다음 for 루프를 사용하여 한 줄을 두 번 읽고 인쇄할 수 있습니다.

$ awk 'FNR==NR{species[$0]}; NR!=FNR{ sub(/>/,"");if ($0 in species){ print $0; for(i=0;i<=1;i++) {getline data;print data}  }}' species.txt input.txt
species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT
species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG
species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG

답변3

declare -a list
readarray list < /path/to/species_list
for species in ${list[@]}; do
   sed -ne "/$species/,/^\$/p" moldata
done

일반적으로 사람들은 인라인 스크립트를 작은따옴표로 묶지 sed만 우리는 데이터를 찾기 위해 쉘 변수를 사용하므로 $쉘이 데이터를 구문 분석하지 못하도록 이스케이프 처리해야 합니다.

답변4

$ awk '
    NR==FNR{ sp[$0]; next }        # this is for reading species data into sp array
    /species/ {                    # from this point for moldata's lines
        if (substr($0,2) in sp) {  # if the read line match with data in sp array
            print                  # first print title
            while(getline > 0) {   # and then read until blank line and print
                if (NF == 0) { print ""; break }
                print 
            } 
        }
    }
' species_list moldata 
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

>species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG

>species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG

관련 정보