두 파일을 공통 열과 비교한 후 각 파일의 열이 포함된 출력 파일을 얻는 방법

Question 1

생물정보학은 흥미로운 것 같습니다. awk가 아닌 솔루션에 열려 있다면 이것이 쉬울 것입니다miller:

mlr --itsv join -u -j chrom,pos --lp tr_ --rp untr_ -f treated.bam.tsv untreated.bam.tsv | # join data from treated and untreated files by fields chrom and pos
mlr put '$tr_pct=($tr_mismatches+$tr_deletions+$tr_insertions)/$tr_reads_all' | # calculate pct for treated data
mlr put '$untr_pct=($untr_mismatches+$untr_deletions+$untr_insertions)/$untr_reads_all' | # calculate pct for untreated data
mlr cut -o -f chrom,pos,tr_ref,tr_reads_all,tr_mismatches,tr_deletions,tr_insertions,tr_pct,untr_ref,untr_reads_all,untr_mismatches,untr_deletions,untr_insertions,untr_pct | # remove superfluous fields
mlr --otsv put '$pct_sub=$tr_pct-$untr_pct' # append pct subtraction field

chrom   pos tr_ref  tr_reads_all    tr_mismatches   tr_deletions    tr_insertions   tr_pct  untr_ref    untr_reads_all  untr_mismatches untr_deletions  untr_insertions untr_pct    pct_sub
chrY    59363551    G   8   0   1   5   0.750000    G   2   0   0   1   0.500000    0.250000
chrY    59363552    G   7   0   0   0   0   G   1   0   0   0   0   0
chrY    59363553    T   7   0   0   0   0   T   1   0   0   0   0   0
chrY    59363554    G   7   0   0   0   0   G   1   0   0   0   0   0
chrY    59363555    T   7   0   0   0   0   T   1   0   0   0   0   0

실제보다 더 무서운 것 같습니다. 진짜.

Answer

생물정보학은 흥미로운 것 같습니다. awk가 아닌 솔루션에 열려 있다면 이것이 쉬울 것입니다miller:

mlr --itsv join -u -j chrom,pos --lp tr_ --rp untr_ -f treated.bam.tsv untreated.bam.tsv | # join data from treated and untreated files by fields chrom and pos
mlr put '$tr_pct=($tr_mismatches+$tr_deletions+$tr_insertions)/$tr_reads_all' | # calculate pct for treated data
mlr put '$untr_pct=($untr_mismatches+$untr_deletions+$untr_insertions)/$untr_reads_all' | # calculate pct for untreated data
mlr cut -o -f chrom,pos,tr_ref,tr_reads_all,tr_mismatches,tr_deletions,tr_insertions,tr_pct,untr_ref,untr_reads_all,untr_mismatches,untr_deletions,untr_insertions,untr_pct | # remove superfluous fields
mlr --otsv put '$pct_sub=$tr_pct-$untr_pct' # append pct subtraction field

chrom   pos tr_ref  tr_reads_all    tr_mismatches   tr_deletions    tr_insertions   tr_pct  untr_ref    untr_reads_all  untr_mismatches untr_deletions  untr_insertions untr_pct    pct_sub
chrY    59363551    G   8   0   1   5   0.750000    G   2   0   0   1   0.500000    0.250000
chrY    59363552    G   7   0   0   0   0   G   1   0   0   0   0   0
chrY    59363553    T   7   0   0   0   0   T   1   0   0   0   0   0
chrY    59363554    G   7   0   0   0   0   G   1   0   0   0   0   0
chrY    59363555    T   7   0   0   0   0   T   1   0   0   0   0   0

실제보다 더 무서운 것 같습니다. 진짜.

Question 2

if ( $1 $2 in array )그것은 작동하지 않습니다; 당신은 그것을 해야 합니다 if (($1,$2) in array).
array[$3]당신은 그렇게 사용할 수 없습니다 array[$4]. 귀하의 배열은 다음과 같습니다
```
배열[chrY,59363551]="chrY 59363551 G 8 0 7 0 0 0 1 0 5 0 0 0 0 0 0 0 7 0 0 0"
배열[chrY,59363552]="chrY 59363552 G 7 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0"
             ︙
```
array[$3]and 라고 하면 존재하지 않는 and 등을 array[$4]의미합니다 .array[G]array[2]
이를 코드에서 지정하는 기능은 여러 파일에 쓰려고 할 때 유용한 기능입니다. 단일 출력 파일만 있는 경우에는 그다지 유용하지 않습니다. 명령의 출력을 리디렉션하는 것이 어떨까요?> "filename"awkawk
긴 대기열은 좋지 않습니다. 긴 명령을 짧은 줄로 나눕니다. 변수를 재사용하여 중복을 줄입니다.
배열을 사용하지 마세요~라고 불리는 array. 이는 이라는 변수 variable, 이라는 파일 file, 이라는 사람 Person등이 있는 것과 같습니다. 설명적인 이름을 사용하십시오.

즉,

awk 'FNR==NR {file1data[$1,$2]=$0; next}
        {       if (($1,$2) in file1data) {
                        # Save desired values from file2.
                        file2arg03=$3
                        file2arg04=$4
                        file2arg08=$8
                        file2arg10=$10
                        file2arg12=$12
                        pct_file2=($8+$10+$12)/$4
                        # Get data from file1.
                        $0=file1data[$1,$2]
                        pct_file1=($8+$10+$12)/$4
                        print $1, $2, $3, $4, $8, $10, $12, pct_file1, \
                                file2arg03, file2arg04, file2arg08, file2arg10, file2arg12, \
                                pct_file2, pct_file1-pct_file2
                } else printf "(%s,%s) in file2 but not file1.%s", $1, $2, ORS
        }' treated.bam.tsv untreated.bam.tsv > awkoutput.bam.tsv

귀하의 버전과 마찬가지로 이것은 file1 데이터를 배열에 저장한 다음 file2를 읽는 동안 모든 작업/출력을 수행합니다. file2에서 행을 받은 후 해당 행의 필수 필드를 명명된 변수에 저장합니다(5개 요소 길이의 다른 배열을 사용할 수도 있음). 그런 다음file1의 해당 행에서 데이터를 복구합니다.. 전체 행을 에 할당하면 $0, , 등 $1이 원래 값으로 되돌아갑니다.$2$3$4

출력에 헤더 행을 쓰는 데 실제로 문제가 있습니까? 노력하다:

        {       if (FNR == 1) {
                        print "chrom pos ref reads_all mismatches deletions insertions pct_file1 …"
                } else if (($1,$2) in file1data ) {
                        file2arg03=$3
                              ︙

좋습니다. 다음은 귀하의 시도에 더 가깝고 헤더 라인을 처리하는 버전입니다.

awk 'FNR==NR {file1line[$1,$2]=$0; next}
        {       if (FNR == 1) {
                        print "chrom pos ref reads_all mismatches deletions insertions pct_file1 ref reads_all mismatches deletions insertions pct_file2 pct_sub …"
                } else if (($1,$2) in file1line ) {
                        # Get data from file1.
                        split(file1line[$1,$2], file1arg)
                        pct_file1=(file1arg[8]+file1arg[10]+file1arg[12])/file1arg[4]
                        pct_file2=($8+$10+$12)/$4
                        print $1, $2, file1arg[3], file1arg[4], file1arg[8], \
                                file1arg[10], file1arg[12], pct_file1, \
                                $3, $4, $8, $10, $12, pct_file2, pct_file1-pct_file2
                } else printf "(%s,%s) in file2 but not file1.%s", $1, $2, ORS
        }' treated.bam.tsv untreated.bam.tsv > awkoutput.bam.tsv

file1line이는 file1(from)에서 행을 검색하고 이를 전달하여 split23개의 구성 요소 값으로 나누어 array 에 저장합니다 file1arg. 그런 다음 , file1arg[3], ... 을 사용하는 것처럼 file1arg[4]사용할 수 있습니다 .array[$3]array[$4]

Answer

if ( $1 $2 in array )그것은 작동하지 않습니다; 당신은 그것을 해야 합니다 if (($1,$2) in array).
array[$3]당신은 그렇게 사용할 수 없습니다 array[$4]. 귀하의 배열은 다음과 같습니다
```
배열[chrY,59363551]="chrY 59363551 G 8 0 7 0 0 0 1 0 5 0 0 0 0 0 0 0 7 0 0 0"
배열[chrY,59363552]="chrY 59363552 G 7 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0"
             ︙
```
array[$3]and 라고 하면 존재하지 않는 and 등을 array[$4]의미합니다 .array[G]array[2]
이를 코드에서 지정하는 기능은 여러 파일에 쓰려고 할 때 유용한 기능입니다. 단일 출력 파일만 있는 경우에는 그다지 유용하지 않습니다. 명령의 출력을 리디렉션하는 것이 어떨까요?> "filename"awkawk
긴 대기열은 좋지 않습니다. 긴 명령을 짧은 줄로 나눕니다. 변수를 재사용하여 중복을 줄입니다.
배열을 사용하지 마세요~라고 불리는 array. 이는 이라는 변수 variable, 이라는 파일 file, 이라는 사람 Person등이 있는 것과 같습니다. 설명적인 이름을 사용하십시오.

즉,

awk 'FNR==NR {file1data[$1,$2]=$0; next}
        {       if (($1,$2) in file1data) {
                        # Save desired values from file2.
                        file2arg03=$3
                        file2arg04=$4
                        file2arg08=$8
                        file2arg10=$10
                        file2arg12=$12
                        pct_file2=($8+$10+$12)/$4
                        # Get data from file1.
                        $0=file1data[$1,$2]
                        pct_file1=($8+$10+$12)/$4
                        print $1, $2, $3, $4, $8, $10, $12, pct_file1, \
                                file2arg03, file2arg04, file2arg08, file2arg10, file2arg12, \
                                pct_file2, pct_file1-pct_file2
                } else printf "(%s,%s) in file2 but not file1.%s", $1, $2, ORS
        }' treated.bam.tsv untreated.bam.tsv > awkoutput.bam.tsv

귀하의 버전과 마찬가지로 이것은 file1 데이터를 배열에 저장한 다음 file2를 읽는 동안 모든 작업/출력을 수행합니다. file2에서 행을 받은 후 해당 행의 필수 필드를 명명된 변수에 저장합니다(5개 요소 길이의 다른 배열을 사용할 수도 있음). 그런 다음file1의 해당 행에서 데이터를 복구합니다.. 전체 행을 에 할당하면 $0, , 등 $1이 원래 값으로 되돌아갑니다.$2$3$4

출력에 헤더 행을 쓰는 데 실제로 문제가 있습니까? 노력하다:

        {       if (FNR == 1) {
                        print "chrom pos ref reads_all mismatches deletions insertions pct_file1 …"
                } else if (($1,$2) in file1data ) {
                        file2arg03=$3
                              ︙

좋습니다. 다음은 귀하의 시도에 더 가깝고 헤더 라인을 처리하는 버전입니다.

awk 'FNR==NR {file1line[$1,$2]=$0; next}
        {       if (FNR == 1) {
                        print "chrom pos ref reads_all mismatches deletions insertions pct_file1 ref reads_all mismatches deletions insertions pct_file2 pct_sub …"
                } else if (($1,$2) in file1line ) {
                        # Get data from file1.
                        split(file1line[$1,$2], file1arg)
                        pct_file1=(file1arg[8]+file1arg[10]+file1arg[12])/file1arg[4]
                        pct_file2=($8+$10+$12)/$4
                        print $1, $2, file1arg[3], file1arg[4], file1arg[8], \
                                file1arg[10], file1arg[12], pct_file1, \
                                $3, $4, $8, $10, $12, pct_file2, pct_file1-pct_file2
                } else printf "(%s,%s) in file2 but not file1.%s", $1, $2, ORS
        }' treated.bam.tsv untreated.bam.tsv > awkoutput.bam.tsv

file1line이는 file1(from)에서 행을 검색하고 이를 전달하여 split23개의 구성 요소 값으로 나누어 array 에 저장합니다 file1arg. 그런 다음 , file1arg[3], ... 을 사용하는 것처럼 file1arg[4]사용할 수 있습니다 .array[$3]array[$4]

두 파일을 공통 열과 비교한 후 각 파일의 열이 포함된 출력 파일을 얻는 방법

답변1

답변2

관련 정보