"while" 루프에 "awk"를 중첩하여 두 파일을 한 줄씩 구문 분석하고 열 값을 비교합니다.

Question 1

awk첫 번째 문제는 그런 식으로 내부적으로 bash 변수를 사용할 수 없다는 것입니다 . $a내부 awk평가는대지 a그러나 는 a에 정의되어 있지 않기 때문에 비어 있습니다 . 이 문제를 해결하는 한 가지 방법은 의 옵션을 사용하여 변수를 정의하는 것 입니다.awkbashawk-v

-v var=val
--assign var=val
   Assign the value val to the variable var,  before  execution  of
   the  program  begins.  Such variable values are available to the
   BEGIN rule of an AWK program.

따라서 다음과 같이 할 수 있습니다.

while read chr a b cov; do 
  awk -v a="$a" -v b="$b" '($2<=a && b <= $3) {print NR}' exons.bed > out$a$b 
done < reads.bed

하지만 또 다른 오류가 있습니다. 읽기가 엑손 내에 속하기 위해서는 읽기의 시작 위치가 엑손의 시작 위치보다 커야 하고 끝 위치가 엑손의 끝 위치보다 작아야 합니다. 이를 사용하여 $2<=a && b <= $3엑손 경계 외부에서 시작하는 읽기를 선택합니다. 당신이 원하는 것은 입니다 $2>=a && $3<=b.

어쨌든, bash 루프에서 이러한 작업을 실행하는 것은 각 sum 쌍에 대해 a입력 파일을 한 번 읽어야 하기 때문에 매우 비효율적입니다 b. 왜 다 하지 않나요 awk?

awk 'NR==FNR{a[NR]=$2;b[NR]=$3; next} {
        for (i in a){
           if($2>=a[i] && $3<=b[i]){
            out[i]=out[i]" "FNR 
        }}}
        END{for (i in out){
                   print "Exon",i,"contains reads of line(s)"out[i],\
                   "of reads file" 
        }}' exons.bed reads.bed

위 스크립트를 샘플 파일에서 실행하면 다음과 같은 출력이 생성됩니다.

Exon 1 contains reads of line(s) 1 of reads file
Exon 2 contains reads of line(s) 2 3 4 5 of reads file

명확성을 위해 여기에는 덜 축약된 형태로 동일한 내용이 있습니다.

#!/usr/bin/awk -f

## While we're reading the 1st file, exons.bed
NR==FNR{
    ## Save the start position in array a and the end 
    ## in array b. The keys of the arrays are the line numbers.
    a[NR]=$2;
    b[NR]=$3; 
    ## Move to the next line, without continuing
    ## the script.
    next;
}
 ## Once we move on to the 2nd file, reads.bed
 {
     ## For each set of start and end positions
     for (i in a){
         ## If the current line's 2nd field is greater than
         ## this start position and smaller than this end position,
         ## add this line number (FNR is the current file's line number)
         ## to the list of reads for the current value of i. 
         if($2>=a[i] && $3<=b[i]){
             out[i]=out[i]" "FNR 
         }
     }
 }
 ## After both files have been processed
 END{
     ## For each exon in the out array
     for (i in out){
         ## Print the exon name and the redas it contains
         print "Exon",i,"contains reads of line(s)"out[i],
             "of reads file" 
        }

Answer

awk첫 번째 문제는 그런 식으로 내부적으로 bash 변수를 사용할 수 없다는 것입니다 . $a내부 awk평가는대지 a그러나 는 a에 정의되어 있지 않기 때문에 비어 있습니다 . 이 문제를 해결하는 한 가지 방법은 의 옵션을 사용하여 변수를 정의하는 것 입니다.awkbashawk-v

-v var=val
--assign var=val
   Assign the value val to the variable var,  before  execution  of
   the  program  begins.  Such variable values are available to the
   BEGIN rule of an AWK program.

따라서 다음과 같이 할 수 있습니다.

while read chr a b cov; do 
  awk -v a="$a" -v b="$b" '($2<=a && b <= $3) {print NR}' exons.bed > out$a$b 
done < reads.bed

하지만 또 다른 오류가 있습니다. 읽기가 엑손 내에 속하기 위해서는 읽기의 시작 위치가 엑손의 시작 위치보다 커야 하고 끝 위치가 엑손의 끝 위치보다 작아야 합니다. 이를 사용하여 $2<=a && b <= $3엑손 경계 외부에서 시작하는 읽기를 선택합니다. 당신이 원하는 것은 입니다 $2>=a && $3<=b.

어쨌든, bash 루프에서 이러한 작업을 실행하는 것은 각 sum 쌍에 대해 a입력 파일을 한 번 읽어야 하기 때문에 매우 비효율적입니다 b. 왜 다 하지 않나요 awk?

awk 'NR==FNR{a[NR]=$2;b[NR]=$3; next} {
        for (i in a){
           if($2>=a[i] && $3<=b[i]){
            out[i]=out[i]" "FNR 
        }}}
        END{for (i in out){
                   print "Exon",i,"contains reads of line(s)"out[i],\
                   "of reads file" 
        }}' exons.bed reads.bed

위 스크립트를 샘플 파일에서 실행하면 다음과 같은 출력이 생성됩니다.

Exon 1 contains reads of line(s) 1 of reads file
Exon 2 contains reads of line(s) 2 3 4 5 of reads file

명확성을 위해 여기에는 덜 축약된 형태로 동일한 내용이 있습니다.

#!/usr/bin/awk -f

## While we're reading the 1st file, exons.bed
NR==FNR{
    ## Save the start position in array a and the end 
    ## in array b. The keys of the arrays are the line numbers.
    a[NR]=$2;
    b[NR]=$3; 
    ## Move to the next line, without continuing
    ## the script.
    next;
}
 ## Once we move on to the 2nd file, reads.bed
 {
     ## For each set of start and end positions
     for (i in a){
         ## If the current line's 2nd field is greater than
         ## this start position and smaller than this end position,
         ## add this line number (FNR is the current file's line number)
         ## to the list of reads for the current value of i. 
         if($2>=a[i] && $3<=b[i]){
             out[i]=out[i]" "FNR 
         }
     }
 }
 ## After both files have been processed
 END{
     ## For each exon in the out array
     for (i in out){
         ## Print the exon name and the redas it contains
         print "Exon",i,"contains reads of line(s)"out[i],
             "of reads file" 
        }

Question 2

나는 그렇지 않다는 것을 안다.상당히무엇을 원하시나요? 하지만 개인적으로 저는 사교적인 사람이 아니므 awk로 Perl을 사용해 보는 것이 좋습니다.

이 같은:

#!/usr/bin/perl

#REALLY GOOD IDEA at the start of any perl code
use strict;
use warnings;

#open some files for input
open( my $exons, "<", 'exons.bed' ) or die $!;

#record where our exons start and finish. 
my %start_of;
my %end_of;

#read line by line our exons file. 
#extract the 3 fields and save 'start' and 'end' in a hash table. 
while (<$exons>) {
    my ( $something, $start, $end ) = split;

    my $exon_id = $.;    #line number;
    $start_of{$exon_id} = $start;
    $end_of{$exon_id}   = $end;
}
close ( $exons );

my %exons;
#run through 'reads' line by line, extracting the files. 

open( my $reads, "<", 'reads.bed' ) or die $!;
while (<$reads>) {
    my ( $thing, $read_start, $read_end, $value ) = split;

    #cycle through each exon. 
    foreach my $exon_id ( keys %start_of ) {

        #check if _this_ 'read' is within the start and end ranges. 
        if (    $read_start >= $start_of{$exon_id}
            and $read_end <= $end_of{$exon_id} )
        {
            #store the line number in our hash %exons. 
            push( @{ $exons{$exon_id} }, $. );
        }
    }
}
close ( $reads ); 

#cycle through %exons - in 'id' order. 
foreach my $exon_id ( sort keys %exons ) {
    #print any matches. 
    print "exon ",$exon_id, " (", $start_of{$exon_id}, " - ", $end_of{$exon_id},
        ") contains reads of line:", join( ",", @{ $exons{$exon_id} } ), "\n";
}

주어진 샘플 데이터를 고려하면:

exon 1 (60005 - 60100) contains reads of line:1
exon 2 (61007 - 61130) contains reads of line:2,3,4,5

좀 더 복잡한 범위 확인/검증을 쉽게 수행하려면 이를 확장할 수 있어야 합니다!

Answer

나는 그렇지 않다는 것을 안다.상당히무엇을 원하시나요? 하지만 개인적으로 저는 사교적인 사람이 아니므 awk로 Perl을 사용해 보는 것이 좋습니다.

이 같은:

#!/usr/bin/perl

#REALLY GOOD IDEA at the start of any perl code
use strict;
use warnings;

#open some files for input
open( my $exons, "<", 'exons.bed' ) or die $!;

#record where our exons start and finish. 
my %start_of;
my %end_of;

#read line by line our exons file. 
#extract the 3 fields and save 'start' and 'end' in a hash table. 
while (<$exons>) {
    my ( $something, $start, $end ) = split;

    my $exon_id = $.;    #line number;
    $start_of{$exon_id} = $start;
    $end_of{$exon_id}   = $end;
}
close ( $exons );

my %exons;
#run through 'reads' line by line, extracting the files. 

open( my $reads, "<", 'reads.bed' ) or die $!;
while (<$reads>) {
    my ( $thing, $read_start, $read_end, $value ) = split;

    #cycle through each exon. 
    foreach my $exon_id ( keys %start_of ) {

        #check if _this_ 'read' is within the start and end ranges. 
        if (    $read_start >= $start_of{$exon_id}
            and $read_end <= $end_of{$exon_id} )
        {
            #store the line number in our hash %exons. 
            push( @{ $exons{$exon_id} }, $. );
        }
    }
}
close ( $reads ); 

#cycle through %exons - in 'id' order. 
foreach my $exon_id ( sort keys %exons ) {
    #print any matches. 
    print "exon ",$exon_id, " (", $start_of{$exon_id}, " - ", $end_of{$exon_id},
        ") contains reads of line:", join( ",", @{ $exons{$exon_id} } ), "\n";
}

주어진 샘플 데이터를 고려하면:

exon 1 (60005 - 60100) contains reads of line:1
exon 2 (61007 - 61130) contains reads of line:2,3,4,5

좀 더 복잡한 범위 확인/검증을 쉽게 수행하려면 이를 확장할 수 있어야 합니다!

"while" 루프에 "awk"를 중첩하여 두 파일을 한 줄씩 구문 분석하고 열 값을 비교합니다.

답변1

답변2

관련 정보