문자열을 추출하면 정규식에서 빈 문자열이 생성됩니다.

문자열을 추출하면 정규식에서 빈 문자열이 생성됩니다.

다음 파일이 있습니다.

awk -F'\t' '$3=="mRNA"'  GCF_000390325.2_Ntom_v01_genomic.gff | head
NW_008828495.1  Gnomon  mRNA    35293   38211   .   +   .   ID=rna-XM_009608413.3;Parent=gene-LOC104084433;Dbxref=GeneID:104084433,Genbank:XM_009608413.3;Name=XM_009608413.3;gbkey=mRNA;gene=LOC104084433;model_evidence=Supporting evidence includes similarity to: 6 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 76 samples with support for all annotated introns;product=cytochrome P450 CYP82D47-like;transcript_id=XM_009608413.3
NW_008828515.1  Gnomon  mRNA    6799    11530   .   +   .   ID=rna-XM_009591409.3;Parent=gene-LOC104116524;Dbxref=GeneID:104116524,Genbank:XM_009591409.3;Name=XM_009591409.3;gbkey=mRNA;gene=LOC104116524;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 22 samples with support for all annotated introns;product=protein JASON-like%2C transcript variant X2;transcript_id=XM_009591409.3
NW_008828515.1  Gnomon  mRNA    6799    11530   .   +   .   ID=rna-XM_009630598.3;Parent=gene-LOC104116524;Dbxref=GeneID:104116524,Genbank:XM_009630598.3;Name=XM_009630598.3;gbkey=mRNA;gene=LOC104116524;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 34 samples with support for all annotated introns;product=protein JASON-like%2C transcript variant X1;transcript_id=XM_009630598.3
NW_008828528.1  Gnomon  mRNA    2303    14453   .   +   .   ID=rna-XM_033657931.1;Parent=gene-LOC117278374;Dbxref=GeneID:117278374,Genbank:XM_033657931.1;Name=XM_033657931.1;gbkey=mRNA;gene=LOC117278374;model_evidence=Supporting evidence includes similarity to: 72%25 coverage of the annotated genomic feature by RNAseq alignments;product=uncharacterized LOC117278374;transcript_id=XM_033657931.1
NW_008828528.1  Gnomon  mRNA    5510    7652    .   -   .   ID=rna-XM_033657569.1;Parent=gene-LOC117278090;Dbxref=GeneID:117278090,Genbank:XM_033657569.1;Name=XM_033657569.1;gbkey=mRNA;gene=LOC117278090;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1;transcript_id=XM_033657569.1
NW_008828528.1  Gnomon  mRNA    5873    8848    .   -   .   ID=rna-XM_033657711.1;Parent=gene-LOC117278090;Dbxref=GeneID:117278090,Genbank:XM_033657711.1;Name=XM_033657711.1;gbkey=mRNA;gene=LOC117278090;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X2;transcript_id=XM_033657711.1
NW_008828570.1  Gnomon  mRNA    5   6611    .   -   .   ID=rna-XM_009610342.3;Parent=gene-LOC104102329;Dbxref=GeneID:104102329,Genbank:XM_009610342.3;Name=XM_009610342.3;gbkey=mRNA;gene=LOC104102329;model_evidence=Supporting evidence includes similarity to: 27 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 56 samples with support for all annotated introns;partial=true;product=TATA-box-binding protein-like;start_range=.,5;transcript_id=XM_009610342.3
NW_008828592.1  Gnomon  mRNA    9998    13370   .   +   .   ID=rna-XM_033658453.1;Parent=gene-LOC104103684;Dbxref=GeneID:104103684,Genbank:XM_033658453.1;Name=XM_033658453.1;gbkey=mRNA;gene=LOC104103684;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 10 samples with support for all annotated introns;product=pentatricopeptide repeat-containing protein At1g15510%2C chloroplastic;transcript_id=XM_033658453.1
NW_008828592.1  Gnomon  mRNA    13457   18285   .   -   .   ID=rna-XM_009612846.3;Parent=gene-LOC104104451;Dbxref=GeneID:104104451,Genbank:XM_009612846.3;Name=XM_009612846.3;gbkey=mRNA;gene=LOC104104451;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 23 samples with support for all annotated introns;product=uncharacterized LOC104104451;transcript_id=XM_009612846.3
NW_008828641.1  Gnomon  mRNA    4417    7406    .   +   .   ID=rna-XM_009613787.3;Parent=gene-LOC104105226;Dbxref=GeneID:104105226,Genbank:XM_009613787.3;Name=XM_009613787.3;gbkey=mRNA;gene=LOC104105226;model_evidence=Supporting evidence includes similarity to: 8 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 75 samples with support for all annotated introns;product=heat shock factor protein HSF30%2C transcript variant X1;transcript_id=XM_009613787.3

다음 명령을 사용하여 ID,product값을 추출하고 있지만.mrna1,

awk -F'\t' '$3=="mRNA"'  GCF_000390325.2_Ntom_v01_genomic.gff | perl -F'\t' -lane 'if($F[2] eq "mRNA"){/ID=([^\;]+).*product="([^"]+)/; print "$1.mrna1,$2"}' > GCF_000390325.2_Ntom_v01_genomic.gff.csv

내가 얻고 싶은 결과는 다음과 같습니다.

rna-XM_009608413.3,cytochrome P450 CYP82D47-like
rna-XM_009591409.3,protein JASON-like%2C transcript variant X2
rna-XM_009630598.3,protein JASON-like%2C transcript variant X1
rna-XM_033657931.1,uncharacterized LOC117278374
rna-XM_033657569.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1
...

내가 놓친 게 무엇입니까?

미리 감사드립니다.

답변1

캡처 변수 $1 $2 erc를 사용할 때마다 먼저 해당 변수가 존재하는지 확인해야 합니다.

이 경우 $1 $2 는 비어 있으며 경고가 켜져 있지 않으므로 이에 대한 알림을 받지 않습니다.

귀하의 정규 표현식에서는 product=" 뒤에 따옴표가 있어야 하며 데이터에는 따옴표가 없습니다. -w 옵션을 사용하여 Perl을 호출하는 것이 좋습니다.

perl  -w -F'\t' -lane 'if(($F[2] eq "mRNA")&&/ID=([^\;]+).*product=([^;]+)/){print "$1.mrna1,$2"}'

답변2

파이프/사용할 필요가 없습니다 perl. 이 모든 작업은 으로 수행할 수 있습니다 awk.

$ awk  -F'[\t;]' '{for(i=11; i < NF;i++) if($i ~ /^product=/) { sub(/ID=/,"",$9); sub(/^product=/, "", $i); print $9","$i }}' infile
rna-XM_009608413.3,cytochrome P450 CYP82D47-like
rna-XM_009591409.3,protein JASON-like%2C transcript variant X2
rna-XM_009630598.3,protein JASON-like%2C transcript variant X1
rna-XM_033657931.1,uncharacterized LOC117278374
rna-XM_033657569.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1
rna-XM_033657711.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X2
rna-XM_009610342.3,TATA-box-binding protein-like
rna-XM_033658453.1,pentatricopeptide repeat-containing protein At1g15510%2C chloroplastic
rna-XM_009612846.3,uncharacterized LOC104104451
rna-XM_009613787.3,heat shock factor protein HSF30%2C transcript variant X1
$ 

관련 정보