![공백을 문자로 바꾸려면 탭으로 구분된 파일 형식을 지정하세요.](https://linux55.com/image/176898/%EA%B3%B5%EB%B0%B1%EC%9D%84%20%EB%AC%B8%EC%9E%90%EB%A1%9C%20%EB%B0%94%EA%BE%B8%EB%A0%A4%EB%A9%B4%20%ED%83%AD%EC%9C%BC%EB%A1%9C%20%EA%B5%AC%EB%B6%84%EB%90%9C%20%ED%8C%8C%EC%9D%BC%20%ED%98%95%EC%8B%9D%EC%9D%84%20%EC%A7%80%EC%A0%95%ED%95%98%EC%84%B8%EC%9A%94..png)
탭 사이의 공백을 문자 "|"로 변환하고 싶습니다. 파일은 여기에서 다운로드할 수 있습니다.
wget http://download.cbioportal.org/cancerhotspots/cancerhotspots.v2.maf.gz
cat cancerhotspots.v2.maf | grep -v version | head -3
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POMOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
WARS2 10352 . GRCh37 1 119575617 119575617 + Missense_Mutation SNP C C T novel 000236 NORMAL C C c.1000G>A p.Val334Ile p.V334I ENST00000235521 6/6 0 . . 0 . . WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,; T ENSG00000116874 ENST00000235521 Transcript missense_variant 1027/2800 1000/1083 334/360 V/I Gtt/Att 1 -1 WARS2 HGNC 12730 protein_coding YES CCDS900.1 ENSP00000235521 Q9UGM6 B7Z5X7 UPI000004A002 NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015) 6/6 Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233 MODERATE 1 SNV ACC . . acyc exome RETAIN V334I 334 V I NA WARS2 334 360 ACC headandneck saca acyc 000236
OPN3 23596 . GRCh37 1 241761094 241761094 + Missense_Mutation SNP G G A rs780348058 000236 NORMAL G G c.899C>T p.Ser300Leu p.S300L ENST00000366554 3/4 0 . . 0 . . OPN3,missense_variant,p.Ser300Leu,ENST00000366554,NM_014322.2;OPN3,missense_variant,p.Ser221Leu,ENST00000331838,;KMO,downstream_gene_variant,,ENST00000366559,NM_003679.4;KMO,downstream_gene_variant,,ENST00000366557,;KMO,downstream_gene_variant,,ENST00000366555,;OPN3,non_coding_transcript_exon_variant,,ENST00000469376,;OPN3,non_coding_transcript_exon_variant,,ENST00000490673,;OPN3,non_coding_transcript_exon_variant,,ENST00000478849,;OPN3,non_coding_transcript_exon_variant,,ENST00000463155,;OPN3,non_coding_transcript_exon_variant,,ENST00000462265,; A ENSG00000054277 ENST00000366554 Transcript missense_variant 1006/2620 899/1209 300/402 S/L tCg/tTg rs780348058 1 -1 OPN3 HGNC 14007 protein_coding YES CCDS31072.1 ENSP00000355512 Q9H1Y3 UPI000000165B NM_014322.2 deleterious(0.02) possibly_damaging(0.692) 3/4 Transmembrane_helices:TMhelix,PROSITE_profiles:PS50262,hmmpanther:PTHR24240:SF64,hmmpanther:PTHR24240,PROSITE_patterns:PS00238,Gene3D:1.20.1070.10,Pfam_domain:PF00001,Superfamily_domains:SSF81321,Prints_domain:PR00237 MODERATE 1 SNV 9.415e-06 0 0 0.0001278 0 0 0 0 . CGA . . 9.426e-06 1/106086 1/106208 0/9066 0/11158 1/7822 0/6612 0/54326 0/694 0/16408 PASS acyc exome RETAIN S300L 300 NA OPN3 300 402 TCG headandneck saca acyc 000236
열에 값이 없으면 두 탭 사이에 공백이 있습니다. 열 수를 셀 때 이를 알 수 있습니다.
cat cancerhotspots.v2.maf | grep -v version | head -4 | awk '{ print NF }'
148
80
99
81
원하는 출력. 열에 값이 없으면 공백은 문자 "|"로 대체됩니다.
cat cancerhotspots.v2.maf | grep -v version | head -2 | sed 's/\t\t/\t|\t/g'
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POMOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
WARS2 10352 . GRCh37 1 119575617 119575617 + Missense_Mutation SNP C C T novel | 000236 NORMAL C C | | | | | | | | c.1000G>A p.Val334Ile p.V334I ENST00000235521 6/6 0 . . 0 . . WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,; T ENSG00000116874 ENST00000235521 Transcript missense_variant 1027/2800 1000/1083 334/360 V/I Gtt/Att | 1 | -1 WARS2 HGNC 12730 protein_coding YES CCDS900.1 ENSP00000235521 Q9UGM6 B7Z5X7 UPI000004A002 NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015) 6/6 | Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233 | | | | | | | | MODERATE 1 SNV | | | | ACC . . | | | | | | | | | | acyc exome RETAIN V334I 334 | V I NA WARS2 334 360 ACC headandneck saca acyc 000236
cat cancerhotspots.v2.maf | grep -v version | head -4 | sed 's/\t\t/\t|\t/g' | awk '{ print NF }'
148
118
128
118
출력은 148개 열이어야 하지만 헤더의 열 개수는 148개입니다.
공백이 있는 경우 모든 열을 "|"로 균일하게 채우는 방법
감사해요!
답변1
당신이 원하는 것은 다음과 같습니다:
awk 'BEGIN{FS=OFS="\t"} {for (i=1;i<=NF;i++) if ($i == "") $i="|"; print}' file
또는:
sed 's/\t\t/\t|\t/g; s/\t\t/\t|\t/g' file
그러나 제공된 예에서는 구별하기가 어렵습니다.
탭 대신 쉼표를 사용하면 가시성이 향상됩니다. 이는 sed를 사용하는 두 가지 대체가 필요한 이유를 보여줍니다.
$ printf 'a,,,,b\n' | sed 's/,,/,|,/g'
a,|,,|,b
$ printf 'a,,,,b\n' | sed 's/,,/,|,/g; s/,,/,|,/g'
a,|,|,|,b
정규 표현식은 ,,
모든 s 쌍과 일치 하므로 ,
모든 홀수 쌍과 일치 ,,
하지만 짝수 쌍은 ,,
두 번째 패스가 수행될 때까지 일치하지 않습니다. 다른 예시:
$ printf '12345678\n' | sed 's/\([0-9]\)\([0-9]\)/\1|\2/g'
1|23|45|67|8
$ printf '12345678\n' | sed 's/\([0-9]\)\([0-9]\)/\1|\2/g; s/\([0-9]\)\([0-9]\)/\1|\2/g'
1|2|3|4|5|6|7|8