공통 열을 기반으로 탭으로 구분된 여러 테이블 병합

Question 1

joinGNU 함수 coreutils는거의무엇을 원하시나요 - 두 파일이 동시에 누락되는 현상을 처리할 수 있는 방법을 찾을 수 없지만,

sort -u \
<(join --header --nocheck-order -t$'\t' -a1 -o 1.1,1.2,2.2 -11 -21 -e'-' file1 file2) \ 
<(join --header --nocheck-order -t$'\t' -a2 -o 2.1,1.2,2.2 -11 -21 -e'-' file1 file2)
100001  C       C
228201  T       -
312002  -       C
341791  T       T
380441  C       C
392640  T       -
412640  -       A
459055  A       A
459079  T       T
464056  -       T
480253  T       -
492633  A       A
570405  T       T
Position        Poly    Poly

헤더 행은 맨 아래로 정렬됩니다. 이것이 문제인 경우 파이프 sed '$d'하거나 제거 할 수 있습니다 head -n -1. 또는 정렬되지 않은 출력이 허용되는 경우 ie awk대신을 사용하여 sort -u중복 항목을 제거 할 수 있습니다.

awk '!a[$1]++' \
<(join --header --nocheck-order -t$'\t' -a1 -o 1.1,1.2,2.2 -11 -21 -e'-' file1 file2) \
<(join --header --nocheck-order -t$'\t' -a2 -o 2.1,1.2,2.2 -11 -21 -e'-' file1 file2)
Position        Poly    Poly
100001  C       C
228201  T       -
341791  T       T
380441  C       C
392640  T       -
459055  A       A
459079  T       T
480253  T       -
570405  T       T
492633  A       A
312002  -       C
412640  -       A
464056  -       T

Answer

joinGNU 함수 coreutils는거의무엇을 원하시나요 - 두 파일이 동시에 누락되는 현상을 처리할 수 있는 방법을 찾을 수 없지만,

sort -u \
<(join --header --nocheck-order -t$'\t' -a1 -o 1.1,1.2,2.2 -11 -21 -e'-' file1 file2) \ 
<(join --header --nocheck-order -t$'\t' -a2 -o 2.1,1.2,2.2 -11 -21 -e'-' file1 file2)
100001  C       C
228201  T       -
312002  -       C
341791  T       T
380441  C       C
392640  T       -
412640  -       A
459055  A       A
459079  T       T
464056  -       T
480253  T       -
492633  A       A
570405  T       T
Position        Poly    Poly

헤더 행은 맨 아래로 정렬됩니다. 이것이 문제인 경우 파이프 sed '$d'하거나 제거 할 수 있습니다 head -n -1. 또는 정렬되지 않은 출력이 허용되는 경우 ie awk대신을 사용하여 sort -u중복 항목을 제거 할 수 있습니다.

awk '!a[$1]++' \
<(join --header --nocheck-order -t$'\t' -a1 -o 1.1,1.2,2.2 -11 -21 -e'-' file1 file2) \
<(join --header --nocheck-order -t$'\t' -a2 -o 2.1,1.2,2.2 -11 -21 -e'-' file1 file2)
Position        Poly    Poly
100001  C       C
228201  T       -
341791  T       T
380441  C       C
392640  T       -
459055  A       A
459079  T       T
480253  T       -
570405  T       T
492633  A       A
312002  -       C
412640  -       A
464056  -       T

Question 2

gawk두 파일을 모두 사용하고 출력 순서에 신경 쓰지 않는 솔루션입니다 .

$ awk '
    FNR == NR {
        if (FNR == 1) {header = "Position "FILENAME;next}
        a[$1] = $2;
        next;
    }
    {
        if (FNR == 1) {header = header" "FILENAME;next}
        if ($1 in a) {
            a[$1] = a[$1]" "$2;
        }
        else {
            a[$1] = "- "$2;
        }
    }
    END {
        print header;
        for (i in a) {
            print i,length(a[i]) == 1 ? a[i]" -" : a[i];
        }
    }
' file1 file2
Position file1 file2
412640 - A
380441 C C
392640 T -
570405 T T
341791 T T
459079 T T
464056 - T
228201 T -
312002 - C
100001 C C
480253 T -
492633 A A
459055 A A

Answer

gawk두 파일을 모두 사용하고 출력 순서에 신경 쓰지 않는 솔루션입니다 .

$ awk '
    FNR == NR {
        if (FNR == 1) {header = "Position "FILENAME;next}
        a[$1] = $2;
        next;
    }
    {
        if (FNR == 1) {header = header" "FILENAME;next}
        if ($1 in a) {
            a[$1] = a[$1]" "$2;
        }
        else {
            a[$1] = "- "$2;
        }
    }
    END {
        print header;
        for (i in a) {
            print i,length(a[i]) == 1 ? a[i]" -" : a[i];
        }
    }
' file1 file2
Position file1 file2
412640 - A
380441 C C
392640 T -
570405 T T
341791 T T
459079 T T
464056 - T
228201 T -
312002 - C
100001 C C
480253 T -
492633 A A
459055 A A

Question 3

괜찮고 python두 파일의 출력 순서에 문제가 없다면 다음과 같습니다.

f = open("path to file1", "r")
g = open("path to file2", "r")
pos_1 = []
pos_2 = []
poly_1 = []
poly_2 = []
for line in f:
    line = line.strip('\n')
    a, b = line.split(' ')
    pos_1.append(a)
    poly_1.append(b)
for line in g:
    line = line.strip('\n')
    a, b = line.split(' ')
    pos_2.append(a)
    poly_2.append(b)
res = pos_1 + pos_2
result = []
for pos in res:
    val1 = "-"
    val2 = "-"
    if pos in pos_1:
        val1 = poly_1[pos_1.index(pos)]
    if pos in pos_2:
        val2 = poly_2[pos_2.index(pos)]
    t = (pos, val1, val2)
    result.append(t)
result = set(result) // to remove duplicates
for val in result:
    pos = val[0]
    val1 = val[1]
    val2 = val[2]
    ans = str(pos)+" "+str(val1)+" "+str(val2)
    print ans

이것은 Python 2.7에 있으며 Python 3의 경우 print를 print() 함수로 바꿉니다.

Answer

괜찮고 python두 파일의 출력 순서에 문제가 없다면 다음과 같습니다.

f = open("path to file1", "r")
g = open("path to file2", "r")
pos_1 = []
pos_2 = []
poly_1 = []
poly_2 = []
for line in f:
    line = line.strip('\n')
    a, b = line.split(' ')
    pos_1.append(a)
    poly_1.append(b)
for line in g:
    line = line.strip('\n')
    a, b = line.split(' ')
    pos_2.append(a)
    poly_2.append(b)
res = pos_1 + pos_2
result = []
for pos in res:
    val1 = "-"
    val2 = "-"
    if pos in pos_1:
        val1 = poly_1[pos_1.index(pos)]
    if pos in pos_2:
        val2 = poly_2[pos_2.index(pos)]
    t = (pos, val1, val2)
    result.append(t)
result = set(result) // to remove duplicates
for val in result:
    pos = val[0]
    val1 = val[1]
    val2 = val[2]
    ans = str(pos)+" "+str(val1)+" "+str(val2)
    print ans

이것은 Python 2.7에 있으며 Python 3의 경우 print를 print() 함수로 바꿉니다.

공통 열을 기반으로 탭으로 구분된 여러 테이블 병합

답변1

답변2

답변3

관련 정보