다음 파일이 있습니다.
awk -F'\t' '$3=="mRNA"' GCF_000390325.2_Ntom_v01_genomic.gff | head
NW_008828495.1 Gnomon mRNA 35293 38211 . + . ID=rna-XM_009608413.3;Parent=gene-LOC104084433;Dbxref=GeneID:104084433,Genbank:XM_009608413.3;Name=XM_009608413.3;gbkey=mRNA;gene=LOC104084433;model_evidence=Supporting evidence includes similarity to: 6 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 76 samples with support for all annotated introns;product=cytochrome P450 CYP82D47-like;transcript_id=XM_009608413.3
NW_008828515.1 Gnomon mRNA 6799 11530 . + . ID=rna-XM_009591409.3;Parent=gene-LOC104116524;Dbxref=GeneID:104116524,Genbank:XM_009591409.3;Name=XM_009591409.3;gbkey=mRNA;gene=LOC104116524;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 22 samples with support for all annotated introns;product=protein JASON-like%2C transcript variant X2;transcript_id=XM_009591409.3
NW_008828515.1 Gnomon mRNA 6799 11530 . + . ID=rna-XM_009630598.3;Parent=gene-LOC104116524;Dbxref=GeneID:104116524,Genbank:XM_009630598.3;Name=XM_009630598.3;gbkey=mRNA;gene=LOC104116524;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 34 samples with support for all annotated introns;product=protein JASON-like%2C transcript variant X1;transcript_id=XM_009630598.3
NW_008828528.1 Gnomon mRNA 2303 14453 . + . ID=rna-XM_033657931.1;Parent=gene-LOC117278374;Dbxref=GeneID:117278374,Genbank:XM_033657931.1;Name=XM_033657931.1;gbkey=mRNA;gene=LOC117278374;model_evidence=Supporting evidence includes similarity to: 72%25 coverage of the annotated genomic feature by RNAseq alignments;product=uncharacterized LOC117278374;transcript_id=XM_033657931.1
NW_008828528.1 Gnomon mRNA 5510 7652 . - . ID=rna-XM_033657569.1;Parent=gene-LOC117278090;Dbxref=GeneID:117278090,Genbank:XM_033657569.1;Name=XM_033657569.1;gbkey=mRNA;gene=LOC117278090;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1;transcript_id=XM_033657569.1
NW_008828528.1 Gnomon mRNA 5873 8848 . - . ID=rna-XM_033657711.1;Parent=gene-LOC117278090;Dbxref=GeneID:117278090,Genbank:XM_033657711.1;Name=XM_033657711.1;gbkey=mRNA;gene=LOC117278090;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X2;transcript_id=XM_033657711.1
NW_008828570.1 Gnomon mRNA 5 6611 . - . ID=rna-XM_009610342.3;Parent=gene-LOC104102329;Dbxref=GeneID:104102329,Genbank:XM_009610342.3;Name=XM_009610342.3;gbkey=mRNA;gene=LOC104102329;model_evidence=Supporting evidence includes similarity to: 27 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 56 samples with support for all annotated introns;partial=true;product=TATA-box-binding protein-like;start_range=.,5;transcript_id=XM_009610342.3
NW_008828592.1 Gnomon mRNA 9998 13370 . + . ID=rna-XM_033658453.1;Parent=gene-LOC104103684;Dbxref=GeneID:104103684,Genbank:XM_033658453.1;Name=XM_033658453.1;gbkey=mRNA;gene=LOC104103684;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 10 samples with support for all annotated introns;product=pentatricopeptide repeat-containing protein At1g15510%2C chloroplastic;transcript_id=XM_033658453.1
NW_008828592.1 Gnomon mRNA 13457 18285 . - . ID=rna-XM_009612846.3;Parent=gene-LOC104104451;Dbxref=GeneID:104104451,Genbank:XM_009612846.3;Name=XM_009612846.3;gbkey=mRNA;gene=LOC104104451;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 23 samples with support for all annotated introns;product=uncharacterized LOC104104451;transcript_id=XM_009612846.3
NW_008828641.1 Gnomon mRNA 4417 7406 . + . ID=rna-XM_009613787.3;Parent=gene-LOC104105226;Dbxref=GeneID:104105226,Genbank:XM_009613787.3;Name=XM_009613787.3;gbkey=mRNA;gene=LOC104105226;model_evidence=Supporting evidence includes similarity to: 8 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 75 samples with support for all annotated introns;product=heat shock factor protein HSF30%2C transcript variant X1;transcript_id=XM_009613787.3
다음 명령을 사용하여 ID,product
값을 추출하고 있지만.mrna1,
awk -F'\t' '$3=="mRNA"' GCF_000390325.2_Ntom_v01_genomic.gff | perl -F'\t' -lane 'if($F[2] eq "mRNA"){/ID=([^\;]+).*product="([^"]+)/; print "$1.mrna1,$2"}' > GCF_000390325.2_Ntom_v01_genomic.gff.csv
내가 얻고 싶은 결과는 다음과 같습니다.
rna-XM_009608413.3,cytochrome P450 CYP82D47-like
rna-XM_009591409.3,protein JASON-like%2C transcript variant X2
rna-XM_009630598.3,protein JASON-like%2C transcript variant X1
rna-XM_033657931.1,uncharacterized LOC117278374
rna-XM_033657569.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1
...
내가 놓친 게 무엇입니까?
미리 감사드립니다.
답변1
캡처 변수 $1 $2 erc를 사용할 때마다 먼저 해당 변수가 존재하는지 확인해야 합니다.
이 경우 $1 $2 는 비어 있으며 경고가 켜져 있지 않으므로 이에 대한 알림을 받지 않습니다.
귀하의 정규 표현식에서는 product=" 뒤에 따옴표가 있어야 하며 데이터에는 따옴표가 없습니다. -w 옵션을 사용하여 Perl을 호출하는 것이 좋습니다.
perl -w -F'\t' -lane 'if(($F[2] eq "mRNA")&&/ID=([^\;]+).*product=([^;]+)/){print "$1.mrna1,$2"}'
답변2
파이프/사용할 필요가 없습니다 perl
. 이 모든 작업은 으로 수행할 수 있습니다 awk
.
$ awk -F'[\t;]' '{for(i=11; i < NF;i++) if($i ~ /^product=/) { sub(/ID=/,"",$9); sub(/^product=/, "", $i); print $9","$i }}' infile
rna-XM_009608413.3,cytochrome P450 CYP82D47-like
rna-XM_009591409.3,protein JASON-like%2C transcript variant X2
rna-XM_009630598.3,protein JASON-like%2C transcript variant X1
rna-XM_033657931.1,uncharacterized LOC117278374
rna-XM_033657569.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1
rna-XM_033657711.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X2
rna-XM_009610342.3,TATA-box-binding protein-like
rna-XM_033658453.1,pentatricopeptide repeat-containing protein At1g15510%2C chloroplastic
rna-XM_009612846.3,uncharacterized LOC104104451
rna-XM_009613787.3,heat shock factor protein HSF30%2C transcript variant X1
$