염색체 영역이 있는 GFF 파일인 7개의 열이 있는 파일이 있습니다. REGION="exon"이 포함된 줄을 파일의 한 줄로 축소하고 싶습니다. 행은 서로 겹치는 영역을 기준으로 축소되어야 합니다.
REGION START END SCORE STRAND FRAME ATTRIBUTE
exon 26453 26644 . + . Transcript "XM_092971"; Name "XM_092971"
exon 26842 27020 . + . Transcript "XM_092971"; Name "XM_092971"
exon 30355 30899 . - . Transcript "XM_104663"; Name "XM_104663"
GS_TRAN 30355 34083 . - . GS_TRAN "Hs22_30444_28_1_1"; Name "Hs22_30444_28_1_1"
snp 30847 30847 . + . SNP "rs2971719"; Name "rs2971719"
exon 31012 31409 . - . Transcript "XM_104663"; Name "XM_104663"
exon 34013 34083 . - . Transcript "XM_104663"; Name "XM_104663"
exon 40932 41071 . + . Transcript "XM_092971"; Name "XM_092971"
snp 44269 44269 . + . SNP "rs2873227"; Name "rs2873227"
snp 45723 45723 . + . SNP "rs2227095"; Name "rs2227095"
exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913"
exon 134034 134457 . - . Transcript "XM_086914"; Name "XM_086914"
위의 샘플 데이터를 보면 마지막 두 행만 하나의 행으로 병합할 수 있습니다. 따라서 새 행이 됩니다.
exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913"
다른 줄의 끝이 그 앞의 줄보다 크다면 이 경우에는 이것이 END 영역이 됩니다. 기본적으로 겹치는 부분이 있으면 먼저 시작하는 영역과 나중에 끝나는 영역을 취합니다.
이러한 인스턴스에는 여러 줄이 있을 수 있습니다. 여기서는 마지막 2줄만 있습니다. 한 가지 점은 ATRRIBUTE 열에는 대부분 동일하지만 행에 대해 확실히 다른 성적표 이름이 표시된다는 것입니다.
진행 방법에 대한 제안 사항.
업데이트된 예: 마지막 두 줄이 다음과 같은 경우
exon 134031 134457 . - . Transcript "XM_086913"; Name "XM_086913"
exon 134034 134495 . - . Transcript "XM_086914"; Name "XM_086914"
그러면 출력은 다음과 같아야 합니다.
exon 134031 134495 . - . Transcript "XM_086913"; Transcript "XM_086914"
기본적으로 첫 번째 줄에서 시작하여 두 번째 줄에서 끝납니다. 왜냐하면 2~3줄, 그 이상의 줄이 아닌 한 줄의 겹치는 부분만 커버하고 싶기 때문입니다. 여기서 겹치는 부분은 2줄 사이에 있지만 2줄 이상일 수도 있습니다.
업데이트된 예시(2014년 3월 24일)
chr1 HAVANA stop_codon 1120520 1120522 . + 0 gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA UTR 1115077 1115233 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA UTR 1115414 1115433 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA UTR 1120520 1121244 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA transcript 1115864 1119307 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA exon 1115864 1116240 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA *exon 1117121 1117195* . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA *exon 1117150 1117826* . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA exon 1118256 1118427 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA transcript 1190648 1209229 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1209046 1209229 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1203113 1203372 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA CDS 1203241 1203372 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA start_codon 1203370 1203372 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA stop_codon 1203238 1203240 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1198726 1198766 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1192588 1192690 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1192372 1192510 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA *exon 1191425 1191505* . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA *exon 1190648 1191470* . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
위쪽 절반은 "+" 체인의 중첩을 나타내고 아래쪽 절반은 "-" 체인의 중첩을 나타냅니다. "-" 체인은 면적이 감소하므로 겹치는 부분은 마지막 두 줄에 표시된 것과 같습니다. 둘 다 서로 다른 유전자이므로 때로는 서로 다른 유전자에도 중복되는 엑손이 있기 때문에 중복이 각 유전자에 대해 있어야 하지만 일부 게시물에서 읽은 것처럼 이는 매우 드뭅니다. "gene_name"으로 표시되는 마지막 컬럼에서 유전자 정보를 추출할 수 있습니다.
gene_name=TTLL10의 두 행에는 겹치는 엑손이 있으므로 최종 출력에서 병합됩니다.
chr1 HAVANA *exon 1117121 1117195* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10";
chr1 HAVANA *exon 1117150 1117826* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10";
gene_name= UBE2J2의 두 줄에는 중복되는 엑손이 있습니다.
chr1 HAVANA *exon 1191425 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
chr1 HAVANA *exon 1190648 1191470* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
샘플 출력
나머지 행은 변경되지 않고 유지되며 위의 행은 각 유전자에 대해 병합됩니다.
chr1 HAVANA *exon 1117121 1117826* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10";
chr1 HAVANA *exon 1190648 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
Transcript_id가 다른 경우 gene_name은 동일하게 유지되지만 두 Transcript ID가 모두 인쇄됩니다. 예를 들어, 유전자의 경우 전사물 ID는 다음과 같이 달라집니다.
chr1 HAVANA *exon 1191425 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
chr1 HAVANA *exon 1190648 1191470* . - . transcript_id "ENST00000473215.2"; gene_name "UBE2J2";
위와 같이 병합되는데 성적표 이름이 2개 있어야 합니다. 왜냐하면, 나중에 성적표 정보를 보관하는 것이 필요하고 중요할 수 있다고 생각하기 때문입니다.
chr1 HAVANA *exon 1190648 1191505* . - . transcript_id "ENST00000473215.1"; "ENST00000473215.2" gene_name "UBE2J2";
답변1
"awk" 방법,
awk '
$1!="exon" { # If the first died is unequal to "exon"
if(previous)print previous # If there is a previous line then print it
print # Print the current line
previous=start=end=exon_seq="" # Set all variable to an empty string
next # Move on to the next line in the input file
}
{
if(exon_seq) { # if there is a sequence of lines with "exon in field 1
if(start<=$2 && end>=$3) # if the start value (field 2) of the previous line
# is less or equal to the current line and the end
# value of the previous line is greater than or
# equal to field 3 of the current line
next # then do nothing and read the next line
else # if there is no overlap,
print previous # then print the previous line
}
else { # if we are not already in the a sequence of
# "exon" lines, then this is the first one
exon_seq=1 # so exon_seq should become 1
}
previous=$0; start=$2; end=$3 # `start` become field2, `end` becomes field 3 and
# `previous` becomes the current record (line)
}
END{ # After all lines are processed
if(previous) print previous # If there still is a previous line, then print it
}
' file
답변2
나는 이러한 복잡한 작업을 해결하기 위해 Perl을 사용할 것입니다. 다음은 부분적인 솔루션입니다. 더 잘 작동하도록 조정해야 할 수도 있습니다.
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw{ max };
sub output {
my $previous = shift;
print join ' ', 'exon',
@{$previous}{qw(start end score strand frame attribute)};
}
$\ = "\n";
my %previous;
while (<>) {
chomp;
my ($region, $start, $end, $score, $strand, $frame, $attribute)
= split ' ', $_, 7;
if ($. == 1) {
print;
} elsif ('exon' eq $region) {
if (keys %previous
and $start < $previous{end} # Overlap.
) {
if ($end > $previous{end}) { # Not contained.
$previous{attribute} =~ s/; Name .*//;
$previous{attribute} .= '; '
. ($attribute =~ /(Transcript ".*?")/)[0];
$previous{end} = $end;
}
} else {
if (keys %previous) {
output(\%previous);
}
%previous = ( start => $start,
end => $end,
score => $score,
strand => $strand,
frame => $frame,
attribute => $attribute,
);
}
} else {
output(\%previous) if keys %previous;
%previous = ();
}
}
output(\%previous) if keys %previous;