Linux에서 필수 열로 파일을 결합하는 방법은 무엇입니까?

Question 1

GNU awk를 사용하세요. 이 명령을 bash 스크립트에 넣었습니다. 더 편리해질 것입니다.

용법: ./join_files.sh또는 예쁘게 인쇄하려면 다음을 수행하십시오 ./join_files.sh | column -t.

#!/bin/bash

gawk '
NR == 1 {
    PROCINFO["sorted_in"] = "@ind_num_asc";
    header = $1;
}

FNR == 1 {
    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 
    header = header OFS file;   
}

FNR > 1 {
    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

산출(테스트를 위해 동일한 내용으로 3개의 파일을 만들었습니다.)

$ ./join_files.sh | column -t
gene_id          TB1      TB2      TB3
ENSG00000000003  1.00     1.00     1.00
ENSG00000000005  0.00     0.00     0.00
ENSG00000000419  1865.00  1865.00  1865.00
ENSG00000000457  1521.00  1521.00  1521.00
ENSG00000000460  1860.00  1860.00  1860.00
ENSG00000000938  6846.00  6846.00  6846.00
ENSG00000000971  0.00     0.00     0.00
ENSG00000001036  1358.00  1358.00  1358.00
ENSG00000001084  1178.00  1178.00  1178.00

설명하다- 동일한 코드에 주석을 추가합니다. 또한 살펴보십시오 man gawk.

gawk '
# NR - the total number of input records seen so far.
# If the total line number is equal 1

NR == 1 {
    # If the "sorted_in" element exists in PROCINFO, then its value controls 
    # the order in which array elements are traversed in the (for in) loop.
    # else the order is undefined.

    PROCINFO["sorted_in"] = "@ind_num_asc";

    # Each field in the input record may be referenced by its position: $1, $2, and so on.
    # $1 - is the first field or the first column. 
    # The first field in the first line is the "gene_id" word;
    # Assign it to the header variable.

    header = $1;
}

# FNR - the input record number in the current input file.
# NR is the total lines counter, FNR is the current file lines counter.
# FNR == 1 - if it is the first line of the current file.

FNR == 1 {
    # remove from the filename all unneeded parts by the "gensub" function
    # was - results/TB1.genes.results
    # become - TB1

    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 

    # and add it to the header variable, concatenating it with the 
    # previous content of the header, using OFS as delimiter.
    # OFS - the output field separator, a space by default.

    header = header OFS file;   
}

# some trick is used here.
# $1 - the first column value - "gene_id"
# $5 - the fifth column value - "expected_count"
FNR > 1 {
    # create array with "gene_id" indexes: arr["ENSG00000000003"], arr["ENSG00000000419"], so on.
    # and add "expected_count" values to it, separated by OFS.
    # each time, when the $1 equals to the specific "gene_id", the $5 value will be
    # added into this array item.

    # Example:
    # arr["ENSG00000000003"] = 1.00
    # arr["ENSG00000000003"] = 1.00 2.00
    # arr["ENSG00000000003"] = 1.00 2.00 3.00

    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

Answer

GNU awk를 사용하세요. 이 명령을 bash 스크립트에 넣었습니다. 더 편리해질 것입니다.

용법: ./join_files.sh또는 예쁘게 인쇄하려면 다음을 수행하십시오 ./join_files.sh | column -t.

#!/bin/bash

gawk '
NR == 1 {
    PROCINFO["sorted_in"] = "@ind_num_asc";
    header = $1;
}

FNR == 1 {
    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 
    header = header OFS file;   
}

FNR > 1 {
    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

산출(테스트를 위해 동일한 내용으로 3개의 파일을 만들었습니다.)

$ ./join_files.sh | column -t
gene_id          TB1      TB2      TB3
ENSG00000000003  1.00     1.00     1.00
ENSG00000000005  0.00     0.00     0.00
ENSG00000000419  1865.00  1865.00  1865.00
ENSG00000000457  1521.00  1521.00  1521.00
ENSG00000000460  1860.00  1860.00  1860.00
ENSG00000000938  6846.00  6846.00  6846.00
ENSG00000000971  0.00     0.00     0.00
ENSG00000001036  1358.00  1358.00  1358.00
ENSG00000001084  1178.00  1178.00  1178.00

설명하다- 동일한 코드에 주석을 추가합니다. 또한 살펴보십시오 man gawk.

gawk '
# NR - the total number of input records seen so far.
# If the total line number is equal 1

NR == 1 {
    # If the "sorted_in" element exists in PROCINFO, then its value controls 
    # the order in which array elements are traversed in the (for in) loop.
    # else the order is undefined.

    PROCINFO["sorted_in"] = "@ind_num_asc";

    # Each field in the input record may be referenced by its position: $1, $2, and so on.
    # $1 - is the first field or the first column. 
    # The first field in the first line is the "gene_id" word;
    # Assign it to the header variable.

    header = $1;
}

# FNR - the input record number in the current input file.
# NR is the total lines counter, FNR is the current file lines counter.
# FNR == 1 - if it is the first line of the current file.

FNR == 1 {
    # remove from the filename all unneeded parts by the "gensub" function
    # was - results/TB1.genes.results
    # become - TB1

    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 

    # and add it to the header variable, concatenating it with the 
    # previous content of the header, using OFS as delimiter.
    # OFS - the output field separator, a space by default.

    header = header OFS file;   
}

# some trick is used here.
# $1 - the first column value - "gene_id"
# $5 - the fifth column value - "expected_count"
FNR > 1 {
    # create array with "gene_id" indexes: arr["ENSG00000000003"], arr["ENSG00000000419"], so on.
    # and add "expected_count" values to it, separated by OFS.
    # each time, when the $1 equals to the specific "gene_id", the $5 value will be
    # added into this array item.

    # Example:
    # arr["ENSG00000000003"] = 1.00
    # arr["ENSG00000000003"] = 1.00 2.00
    # arr["ENSG00000000003"] = 1.00 2.00 3.00

    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

Question 2

귀하의 질문을 올바르게 이해했다면 많은 열을 출력해야 할 때 상황을 처리하는 방법을 알고 싶습니다. cut사용 중인 명령은 열의 범위를 이해합니다. 예를 들어, 1, 5열과 7~13, 17~끝의 모든 열을 출력하려면 다음을 사용합니다.

cut -f1,5,7-13,17-

또는 이 cut명령을 사용하여 특정 필드를 제외할 수 있습니다. 예를 들어 필드 번호 5를 제외합니다.

cut --compliment -f5

내가 보기에 당신이 하고 싶은 것은 두 번째 열인 Transcript_id를 제거하는 것이므로 다음을 사용하겠습니다.

cut --compliment -f2

p.s. 제공하신 데이터는 스크립트에 적용되지 않습니다. 단순화하고 일부 열을 제거한 것 같습니다.

Answer

귀하의 질문을 올바르게 이해했다면 많은 열을 출력해야 할 때 상황을 처리하는 방법을 알고 싶습니다. cut사용 중인 명령은 열의 범위를 이해합니다. 예를 들어, 1, 5열과 7~13, 17~끝의 모든 열을 출력하려면 다음을 사용합니다.

cut -f1,5,7-13,17-

또는 이 cut명령을 사용하여 특정 필드를 제외할 수 있습니다. 예를 들어 필드 번호 5를 제외합니다.

cut --compliment -f5

내가 보기에 당신이 하고 싶은 것은 두 번째 열인 Transcript_id를 제거하는 것이므로 다음을 사용하겠습니다.

cut --compliment -f2

p.s. 제공하신 데이터는 스크립트에 적용되지 않습니다. 단순화하고 일부 열을 제거한 것 같습니다.

Linux에서 필수 열로 파일을 결합하는 방법은 무엇입니까?

답변1

답변2

관련 정보