앗 문자열 자르기

앗 문자열 자르기

다음 명령을 실행하면 다음과 같은 전체 문자열이 인쇄됩니다.Note="Peptidase S59%2C nucleoporin"

awk '$3=="mRNA"'  Nitab-v4.5_gene_models_Chr_Edwards2017.gff | head 
Nt01    maker   mRNA    143295  155540  .   +   .   ID=Nitab4.5_0006317g0010.1;Parent=Nitab4.5_0006317g0010;Name=Nitab4.5_0006317g0010.1;_AED=0.08;_eAED=0.08;_QI=0|0.45|0.25|1|0.90|0.75|12|0|1011;Note="Peptidase S59%2C nucleoporin"
Nt01    maker   mRNA    170633  173860  .   +   .   ID=Nitab4.5_0006317g0020.1;Parent=Nitab4.5_0006317g0020;Name=Nitab4.5_0006317g0020.1;_AED=0.26;_eAED=0.26;_QI=15|0|0|0.83|0.6|0.33|6|0|424;Note="Putative S-adenosyl-L-methionine-dependent methyltransferase"
Nt01    maker   mRNA    156516  160996  .   -   .   ID=Nitab4.5_0006317g0030.1;Parent=Nitab4.5_0006317g0030;Name=Nitab4.5_0006317g0030.1;_AED=0.01;_eAED=0.01;_QI=161|1|1|1|0|0.5|2|358|141;Note="Unknown"
Nt01    maker   mRNA    78554   80638   .   -   .   ID=Nitab4.5_0006317g0040.1;Parent=Nitab4.5_0006317g0040;Name=Nitab4.5_0006317g0040.1;_AED=0.02;_eAED=0.02;_QI=0|0|0|1|1|1|3|0|187;Note="Heavy metal-associated domain%2C HMA"
Nt01    maker   mRNA    111288  129916  .   -   .   ID=Nitab4.5_0006317g0050.1;Parent=Nitab4.5_0006317g0050;Name=Nitab4.5_0006317g0050.1;_AED=0.24;_eAED=0.24;_QI=0|0|0|0.5|1|1|2|0|72;Note="Unknown"
Nt01    maker   mRNA    470560  474346  .   +   .   ID=Nitab4.5_0002367g0010.1;Parent=Nitab4.5_0002367g0010;Name=Nitab4.5_0002367g0010.1;_AED=0.11;_eAED=0.11;_QI=0|0|0|1|1|1|14|0|668;Note="Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation"
Nt01    maker   mRNA    499946  502182  .   +   .   ID=Nitab4.5_0002367g0020.1;Parent=Nitab4.5_0002367g0020;Name=Nitab4.5_0002367g0020.1;_AED=0.26;_eAED=0.26;_QI=0|0.5|0|0.66|0|0|3|0|258;Note="Cellulose synthase"
Nt01    maker   mRNA    496891  497596  .   +   .   ID=Nitab4.5_0002367g0030.1;Parent=Nitab4.5_0002367g0030;Name=Nitab4.5_0002367g0030.1;_AED=0.33;_eAED=0.33;_QI=0|0|0|0.5|0|0.5|2|0|213;Note="Cellulose synthase"
Nt01    maker   mRNA    505125  506853  .   -   .   ID=Nitab4.5_0002367g0040.1;Parent=Nitab4.5_0002367g0040;Name=Nitab4.5_0002367g0040.1;_AED=0.09;_eAED=0.09;_QI=0|0|0|1|0.5|0.66|3|0|230;Note="Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type"
Nt01    maker   mRNA    564383  570328  .   +   .   ID=Nitab4.5_0002367g0050.1;Parent=Nitab4.5_0002367g0050;Name=Nitab4.5_0002367g0050.1;_AED=0.08;_eAED=0.08;_QI=75|1|1|1|1|1|6|146|267;Note="SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12"

그러나 다음 명령을 사용하면 문자열이 다음과 같이 단축됩니다.참고 = "펩티다아제

awk '$3=="mRNA"'  Nitab-v4.5_gene_models_Chr_Edwards2017.gff | awk '{print $9}' | head 
ID=Nitab4.5_0006317g0010.1;Parent=Nitab4.5_0006317g0010;Name=Nitab4.5_0006317g0010.1;_AED=0.08;_eAED=0.08;_QI=0|0.45|0.25|1|0.90|0.75|12|0|1011;Note="Peptidase
ID=Nitab4.5_0006317g0020.1;Parent=Nitab4.5_0006317g0020;Name=Nitab4.5_0006317g0020.1;_AED=0.26;_eAED=0.26;_QI=15|0|0|0.83|0.6|0.33|6|0|424;Note="Putative
ID=Nitab4.5_0006317g0030.1;Parent=Nitab4.5_0006317g0030;Name=Nitab4.5_0006317g0030.1;_AED=0.01;_eAED=0.01;_QI=161|1|1|1|0|0.5|2|358|141;Note="Unknown"
ID=Nitab4.5_0006317g0040.1;Parent=Nitab4.5_0006317g0040;Name=Nitab4.5_0006317g0040.1;_AED=0.02;_eAED=0.02;_QI=0|0|0|1|1|1|3|0|187;Note="Heavy
ID=Nitab4.5_0006317g0050.1;Parent=Nitab4.5_0006317g0050;Name=Nitab4.5_0006317g0050.1;_AED=0.24;_eAED=0.24;_QI=0|0|0|0.5|1|1|2|0|72;Note="Unknown"
ID=Nitab4.5_0002367g0010.1;Parent=Nitab4.5_0002367g0010;Name=Nitab4.5_0002367g0010.1;_AED=0.11;_eAED=0.11;_QI=0|0|0|1|1|1|14|0|668;Note="Auxin
ID=Nitab4.5_0002367g0020.1;Parent=Nitab4.5_0002367g0020;Name=Nitab4.5_0002367g0020.1;_AED=0.26;_eAED=0.26;_QI=0|0.5|0|0.66|0|0|3|0|258;Note="Cellulose
ID=Nitab4.5_0002367g0030.1;Parent=Nitab4.5_0002367g0030;Name=Nitab4.5_0002367g0030.1;_AED=0.33;_eAED=0.33;_QI=0|0|0|0.5|0|0.5|2|0|213;Note="Cellulose
ID=Nitab4.5_0002367g0040.1;Parent=Nitab4.5_0002367g0040;Name=Nitab4.5_0002367g0040.1;_AED=0.09;_eAED=0.09;_QI=0|0|0|1|0.5|0.66|3|0|230;Note="Zinc
ID=Nitab4.5_0002367g0050.1;Parent=Nitab4.5_0002367g0050;Name=Nitab4.5_0002367g0050.1;_AED=0.08;_eAED=0.08;_QI=75|1|1|1|1|1|6|146|267;Note="SAC3/GANP/Nin1/mts3/eIF-3

최종 결과로 Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin.

내가 놓친 게 무엇입니까?

미리 감사드립니다

답변1

GFF는 탭으로 구분된 형식이지만 탭을 사용하지 않습니다. -F'\t'or 를 사용하지 않는 한 BEGIN{FS="\t"}awk는 공백을 포함하여 모든 공백을 필드 구분 기호로 사용합니다. 공백을 자르는 중이므로 $9첫 번째 공백에서 끝납니다. 두 가지 명령도 필요하지 않습니다. 수행해야 할 작업은 다음과 같습니다.

$ awk -F'\t' '$3=="mRNA"{print $9}' file.gff 
ID=Nitab4.5_0006317g0010.1;Parent=Nitab4.5_0006317g0010;Name=Nitab4.5_0006317g0010.1;_AED=0.08;_eAED=0.08;_QI=0|0.45|0.25|1|0.90|0.75|12|0|1011;Note="Peptidase S59%2C nucleoporin"
ID=Nitab4.5_0006317g0020.1;Parent=Nitab4.5_0006317g0020;Name=Nitab4.5_0006317g0020.1;_AED=0.26;_eAED=0.26;_QI=15|0|0|0.83|0.6|0.33|6|0|424;Note="Putative S-adenosyl-L-methionine-dependent methyltransferase"
ID=Nitab4.5_0006317g0030.1;Parent=Nitab4.5_0006317g0030;Name=Nitab4.5_0006317g0030.1;_AED=0.01;_eAED=0.01;_QI=161|1|1|1|0|0.5|2|358|141;Note="Unknown"
ID=Nitab4.5_0006317g0040.1;Parent=Nitab4.5_0006317g0040;Name=Nitab4.5_0006317g0040.1;_AED=0.02;_eAED=0.02;_QI=0|0|0|1|1|1|3|0|187;Note="Heavy metal-associated domain%2C HMA"
ID=Nitab4.5_0006317g0050.1;Parent=Nitab4.5_0006317g0050;Name=Nitab4.5_0006317g0050.1;_AED=0.24;_eAED=0.24;_QI=0|0|0|0.5|1|1|2|0|72;Note="Unknown"
ID=Nitab4.5_0002367g0010.1;Parent=Nitab4.5_0002367g0010;Name=Nitab4.5_0002367g0010.1;_AED=0.11;_eAED=0.11;_QI=0|0|0|1|1|1|14|0|668;Note="Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation"
ID=Nitab4.5_0002367g0020.1;Parent=Nitab4.5_0002367g0020;Name=Nitab4.5_0002367g0020.1;_AED=0.26;_eAED=0.26;_QI=0|0.5|0|0.66|0|0|3|0|258;Note="Cellulose synthase"
ID=Nitab4.5_0002367g0030.1;Parent=Nitab4.5_0002367g0030;Name=Nitab4.5_0002367g0030.1;_AED=0.33;_eAED=0.33;_QI=0|0|0|0.5|0|0.5|2|0|213;Note="Cellulose synthase"
ID=Nitab4.5_0002367g0040.1;Parent=Nitab4.5_0002367g0040;Name=Nitab4.5_0002367g0040.1;_AED=0.09;_eAED=0.09;_QI=0|0|0|1|0.5|0.66|3|0|230;Note="Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type"
ID=Nitab4.5_0002367g0050.1;Parent=Nitab4.5_0002367g0050;Name=Nitab4.5_0002367g0050.1;_AED=0.08;_eAED=0.08;_QI=75|1|1|1|1|1|6|146|267;Note="SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12"

값만 얻으려면 Note=다음을 수행할 수 있습니다.

$ awk -F"\t" '$3=="mRNA"{sub(/.*Note=/,"",$9); print $9}' file.gff 
"Peptidase S59%2C nucleoporin"
"Putative S-adenosyl-L-methionine-dependent methyltransferase"
"Unknown"
"Heavy metal-associated domain%2C HMA"
"Unknown"
"Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation"
"Cellulose synthase"
"Cellulose synthase"
"Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type"
"SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12"

그리고 댓글 자체에 속할 수 있는 부분을 유지하면서 댓글의 시작과 끝 부분에 있는 따옴표를 제거하려면 다음을 수행하세요.

$ awk -F"\t" '$3=="mRNA"{sub(/.*Note=/,"",$9); print $9}' file.gff | sed 's/^"//; s/"$//'
Peptidase S59%2C nucleoporin
Putative S-adenosyl-L-methionine-dependent methyltransferase
Unknown
Heavy metal-associated domain%2C HMA
Unknown
Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Cellulose synthase
Cellulose synthase
Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12

Note마지막으로 sum 값을 얻으려면 ID다음을 수행할 수 있습니다.

$ awk -F"\t" '$3=="mRNA"{n=$9; sub(/.*Note="/,"",n); sub(/"$/,"",n); sub(/.*ID=/,"",$9); sub(/;.*/,"",$9); print $9","n}' file.gff 
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12

그러나 개인적으로 나는 Perl에서 이 작업을 수행합니다.

$ perl -F'\t' -lane 'if($F[2] eq "mRNA"){/ID=([^\;]+).*Note="([^"]+)/; print "$1,$2"}' file.gff 
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome 

답변2

각 줄 끝에 있는 인용 문자열에 탭이나 a가 포함될 수 없다고 가정하면 ;아마도 다음과 같은 내용이 필요할 것입니다.

$ awk -F'\t' -v OFS=',' '$3=="mRNA"{ gsub(/;/,FS); gsub(/[^\t=]+=|"/,""); print $9, $15 }' file
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimeriion
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12

또는 (원래 아이디어):

$ cat tst.awk
BEGIN { FS="\t"; OFS="," }
$3 == "mRNA" {
    split($NF,f,/;/)
    for (i in f) {
        gsub(/^[^=]+="?|"$/,"",f[i])
    }
    print f[1], f[7]
}

.

$ awk -f tst.awk file
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12

포함할 수 있는 경우 OFS ,사용을 재고하고 탭 또는 대신을 사용해야 합니다 .,;

답변3

ID="최종 결과" 출력 형식이 및 필드 에서 파생된다고 가정하면 Note=다음이 작동합니다.

$ awk -F'\t' '$3=="mRNA"{split($9,f,";"); split(f[1],i,"="); split(f[7],n,"\""); printf("%s,%s\n",i[2],n[2])}' file.gff
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12

이는 탭으로 구분된 9번째 필드를 ;배열에 저장된 구분된 하위 필드 로 분할합니다 f. 여기서 첫 번째 하위 필드( ID=...)는 에서 분할되고 =7번째 하위 필드( Note=" ... ") 는 에서 분할되어 의미를 분리합니다. 관심 있는 부분은 각각 보조 배열 과 "로 분리됩니다. . 그런 다음 이들 중 관련 부분을 인쇄하십시오.in

이는 상당히 간결한 코드를 허용하지만 반드시 가장 효율적인 접근 방식은 아닙니다.

답변4

GNU awk 사용:

awk -F'\t' -vOFS=, 'NF>=9&&$3=="mRNA"{
N = split($9, a, /[=;]/)
for ( i=1; i<=N; i+=2 )
    h[a[i]] = a[i+1]
print h["ID"], gensub(/"(.*)"/, "\\1", "g", h["Note"])
}' bioinformatic.file

Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin

관련 정보