각 헤더 행의 특정 열에 있는 키워드를 기반으로 파일을 별도의 파일로 분할합니다.

각 헤더 행의 특정 열에 있는 키워드를 기반으로 파일을 별도의 파일로 분할합니다.

헤더 줄에 언급된 첫 번째 염색체를 기반으로 별도의 파일을 만들고 싶습니다. 24개의 염색체가 제목 줄에 언급되어 있으며 그 서열은 다음 두 줄에 언급되어 있습니다. 파일 구조는 다음과 같습니다:
헤더
인간 서열
기타 게놈 서열

그러나 모든 염색체 서열은 하나의 파일로 결합되어 있으며 이를 별도의 염색체 파일과 해당 서열 쌍으로 분할하고 싶습니다. 이를 위해 Python 스크립트를 만들었지만 클러스터에 병합된 대용량 파일을 업로드하는 데 시간이 많이 걸리고 종종 연결 오류가 발생합니다. 그래서 Bash 스크립트를 사용하고 싶습니다.

아이디어는 헤더 행의 두 번째 열에서 "chrY"(또는 크롬 이름이 무엇이든)를 검색한 다음 해당 헤더 행과 그 뒤에 오는 2개의 시퀀스 행을 별도의 파일에 붙여넣는 것입니다.

2057524 chrY 68 170 chrX 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA

세부 사항:

2057524 chrY (human chromosome) 68 170 chrX (other genome chromosome) 23685 23787 - 4125 -> header line
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA (human sequence)
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA (other genome sequence)

테스트 목적으로:

2057521 chr10 57211219 57211230 NW_007726181v1 1018288 1018299 + 575
CTGGGCACTATG
CTGAGCGCGGTG

2057522 chr2 57211231 57214400 NW_007726181v1 1018406 1021615 + 116172
GTTTtgagcttgt----acccagcgctgcttttgccttgctctgtgaccccaggcaagctgcctcacctctctgggccagtttccccat-cgtacagtggTGCTGCACACCCTGGCCCTGGCCC-CGAGGTGGCTGGGAGGTGGCTCCTCAAACAGCCGCTTTCTCATCAGTGCCCGGTGCTGGGT-CAGGGATCGACTGAGGCTCT--GAGCTAACTAGGAAACACAGTGGCCTTG--GAGGGCTGGGGAGTGTCATGGGGGTG---GGGACAGGGAGCCACCGGTCGCATGTGACTGAACTCTT-----------------CACCCCAGTCTGTGGCTTTCCCGTTGCAGTGAGAGCCACGAGCCAAGGTGGGCACTTGATGTCGGATCTCTTCAACAAGCTGGTCATGAGGCGCAAGGGTAGGAGGCAGGGCCGCTGCCCGCCCTGGGTCGGCACCT---------------TGTAATTCTGTCCTGCCTTTTTCTTCCTGTATTTAAGTCTCCGGGGGCTGGGGGAATCAGGGTTTCCCACCAACCACCCTCACTCAGCCTTTTCCC-TCCAGGCATCTCTGGGAAAGGACCT------GGGGCTGGTGAGGGGCCCGGAGGAGCCTTTGCCCGCGTGTCAGACTCCATCCCTCCTGTGCCCCCACCGCAACAGCCACAGGCAGAGGAGGACGAGGACGACTGGGAATCCTAGGGGGCTCCATGACACCTTCCCCCCCAGACCCAGACTTGGGCTGTTGCTCTGACATGGACACAGCCAGGACAAGCTGCTCAGACCTGCTTCCCTGG-GAGGGGGTGACGGAACCAGCACTG---------TGTG-GAGACCAGCTTCAAGGAGCGGAAGGCTGGCTTGAGGCCACACAGCTGGGGCG---GGGACTT-CTGTCTGCCTGTGCTCCATGGGGGGACGGCTCCACC--------CAGCCTGCGCCACTGTGTTCTTAAGAGGCTTCCAGAGAAAACGGCA-CACCAATCAATA-----------AAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGACCTCTAGCTGTGCAGGATGCAAACGTCTCGGGGTCAGTGACTGCCTCCTGCCCCTGTTGGTCCCTAGGCAGTGGGGGCAGAAGCTCCCAGCTGACCTG------TTTCTCTGGGAGAGAAGGGCAGTCAGCAGGGGCAGCTGTTGCAGATGGGAGGAATAG--------TCTCCCACA----AAAAAGGTTTCAGTGACAGACACGGGGTCTCTAAAAATAGTCATGCTGAGAGCCCAATGGCCCTTGGCACAATTGCTGGTGTTGGGGTAGAAGATGTCTTGGAGTTTGCTCAAGTGGTTGAGAGGGAGGGAGGTGCCATCAACTT---GGAGGAACTGGCACCAAGCCAGGGAGATAGAAATCCAGGCAAGGCTGTGGGGCAGGTTAGGGAGCAAGGCTGCAGGGGTGACTCAGGAAGAAGGTGGGGGAGGTGACAAGCCCCCAGGCAGGGGCCCTGTGGCC-------------ATGGGGATCTTTTTAAATTGAGACTAGGGGGTGAATAGTCCAGGGCAGCTAACTTTAGTTATTATAGAAAG-GGCAGTAGCAGATGGGTCTG-CTCCGTCTCGCTTCTAAGAAGGTGG---------------GCAGGACAAATGGCAGCCTCCTGCAGAGGCCCAGTGAGAAGCCTGGCCC-------TCGGCCAC-----ACAGGATGGAAGACAGATTGGATTCCACAGAGGGGAGCTGCCCTGGGAAGATCTCACGGATGGCCAGGACCCACCATTTCTTCGGGGTTCCCCT-GTTTTCTCCAACGGGCACTAATGCCTGTGCCTGGGTCCTGGCAACAC----------------------TCTGGACTCCACACTCT--TCTGGGTTTCACCTTTGTA-GCAGGATCCCTGCAGATCAGGCCCATGACAAACACCGTCTCCAGCGGGCAGAGCAAAGGAAGGGCGCAGCGCCAGGCAGTGGTGCAGCTGCCTGTCAGGAAGAGGCCTACTTCT---GGTGAAACTGGGCAGAC---AAAAGGCAGTGAGAAATGTGATCTCGGGGTGGTGGAGGCTC-TAGGGAAAGGAAAAGGCAGGAGTGAACTTCCACACAGCAGCAATGGCAGAACCAAAGGTGGCTTTGACCTCCACGAGGGCTCAGATCCAGGCCAACAGCTTGTCCAGGACAGGGTGCCGGGTGTATCACTAATCCAGGAGCACTATGCTGGCAGAATCCCTTTGGTGCCTGATGGCCCTGCCTTCGTGGGAACAGAGGCTAAGGCTTTGAGTTACAGCTGCCTCCCCAACAGTGCATCCCCTTCTCCTTCCTCAGCCTCAGGTAGGAGACAGGGCAGGCAACCCCCCTTTCCTCTTCTCCCCTTCTCCAGCCCCTGTCTGTCCACCCAGCTGGAGGCAG--CCAGGCTTGCCTATGGACTGGTTGACAGCCTTCATGCACAGGTTCTCCACCAGAGCCTTTCTTGGGGGCCCCTGGCT--GGGCTCTGAGCTGGGAGTGAAGGGGATGACCCATGCGGACTGTTTGCTGC-------------TTGTAGCTTTCCCTGGGA-AAGACTCTGCCAGGCCTTGGAGCCAGACCAGGAGGCTTTATAGGCCACTGCAAGCAGCAGGGCTCCAGATGACATCACAGGGAATATCAAGAGGGTGTGGAGGGGCATCGAAGCCTCTCCAGGAG---ACAG----GAGAC---GCCGGCCCAGTAGAGCCCTAGGGGCGACGCCACTCCCACTCACTGTCTACTCTCCTCTCACCTCTGCAACACTGGGGACACTCACAAGATTGTGATCCAAGTCGGCCGTCGTCTTCTGCAGCTCTGGAGACCTGATGCTGGGGAAGGGCATGCCTGGCATCACCACACACCTGGGAGGAGACAGGAGCCTG-GGGCCGGTGG---------------------GCCCACACATCACCAGCTGCTCCGTTCTACCATTTCTTCAGCCCTCTTGGCTGTGC-CTGCGGCTCTGCCCCTCCCCTCTCTGCACCTACCACCCAGAGAGGGCTTGTTGAGCTCAGAGATCCCACCTAGGCCAATCCACTGGGTTCTGTGGCAGCGATGGCCTGCCTGATCTTCCACCTGCTCTCCCAGGGCCAAAGCCAGACCTGCTGAGCCCCTCCC--TCCAGCCGGCTGGT-CTGAGCAGTCACAGCCCGGCTTTGGGCTCCGATGGCAGCAGATGGCAGGTAGGGGTCCAGCTGCTGG-AGCGAGGGCCGGCCACGTATCACAG-CCAAGGAGATGAGCACAAG--CACTACTTACTGGCCTAGGTTGTCAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAATC
gtcctgagtttgccaaggcccagctctgcttctgacttgtcctgtg-----agacaaagtgcctaacgtctttgggccagtttcctcatccccacagtggggctgcaca-cctgcccgtgtcttacaggatggccgtgatgt------tca-----CCATTTTCTTAT-AGTACCCACTGCCAGATACACGGACAGACCAAAGCTCCCAGAGCTCA-TGGTGAACAT-GTGGCTGTGGAGAGGGCTGGGGACTGTTGCAGGGGCAAGTGAGCCAAGAGGGCACTGG-CGACTGGGCCTGGAGCCTCCCGACTTGGCCCCCGCACCCTCCCACCTGCAGCTTTCCTCTTGCAGTGAGAGCCACAAGCCAAGGTGGAGACCTGATGTCAGATCTCTTCAACAAGCTGGTCATGAGGCGCAAAGGTAGGAGGCAGAGGGGCCGCC--TTCAGGGCAGGGGCCTCAGGGTGTCCTGCAGTGTACTTCTGTTCTGCCTCTTTCTTCCTGTTTTTAAATCTTCAGGGGCTGTGTGACCTGGGGCCTCCATACACCCTCCCTCA--CAGCCTTTTCCCTTCCAGGTATCTCCGGGAAAGGACCTGGAACAGGGGCCAGCGAGGGGCCAGGAGGAGCCTTCGCCCGAATGTCAGACTCCATCCCGCCTCTGCCTCCCCCACAGCAGCCAC---CGGGAGAGGACGAGGATGACTGGGAATCCTAGGGGTCT-CAGCACTCCTTCCTCCCCCAACCCAGACTTGGGCTGTGGCCCTGAGACAGACACAGCTGGGACA--------------GCCCCCTTGGTGAGACAGGGATGGTG-CAGGACTGCCCTACGTCTGTGCTGGGCCTTCTTCAGGGAGCGGGTAGGTTGCATGAAACCATAAGTGTGGGGTGGGAGGGGCTCGCTCTCCACCTGTGCCCCACCGTGTGCCTGCTCTACCCACCCCTTCAGCGTGTGCTCCTCTTCCCGAAAGAGACT--CGAAGAAAACAGCACCATGAATCAATAAAGGACGATGTAAGAACTGAGCATAAACCAACAGTGCACTTTTAATTAAGGAGTCAAGGCTGGGTGGCTTGCAAACATCTGAGAACCAGTGACTG--TCCTGCCCC-GTGGGTCTCCAGGCAAT-GGGGCAGAACATCTGAGTGGACCAGGGCCCCTTGCACTGGCTCGAAGGTGCAGTCAGCAGGGGCAGCTGCTGTGGATGGGAGGGAGGGAGGGAGATGTTCCCACGGGATAAAGATGTCTCAGTGACAGACATGGGGTCTCTAAAAATAGTTGTGCTGAGAGCCTAATGGCCCTTGGCATAATTGCTGATGTCAGGGTAGAAGGTGTCTTGGAGTTTGCTCAAGTGCCTGAGAGGGAAGGAGGTGCCATCAACTTGGAGGAGGAATGGGAGCCAAGCCAGAGAGA-AAAGCCCTGCGCGGAGCTGTGGAGCAGACCA--GAGCACAGCTG-----------------------------------AGGCTGGCAGG-AGGAGCCGTGTGGACAGCAGAACTAGAAATGGGGAACGTTTTGAGT------------GTGAAATGTCTAGAACAGCTCATTTTAGCTAGGATGAACAGAGGCAG-----GATGGGCCTGTTTCCATCGGACCTCTGAGAAGGTGGCTACTGAGAAAACATGCAGGACAGAAG-----CTGCAGCAGAACACCGGGCAGGAGCCTGGCGCGGCCAGTGTGGCCACACTAAACAGGGAGGAAGATGCAATGG------CAGGGAGCAGCTGCCCTGCAGTGGGCTCAAGGGCAGTCAGGACCCACTGTTTACTCAGGATCAACCTAGTTTTCTCCAACTGGCTTTTCTACCTGGGCCTGCATGCGGGCAGCCCACTGATGCTGGAAGGGGGCTGGTCTGGACCTCACACTCTACACCTGGTTTCACCTTCTTAGGCAGGATCCCTGTAGACCAGGCCCAAGACAAACACCATTCTAAGTGGGCAGGGTAAAGGAAGAGC------CCGGGC--TGGTGCAGCCATCCATCAGGAACGGCCAAACTTCTCCCGATGAAACTGGGGAGATGGGAAAAGGCAGTGAGAGACTAGATCTCAGGGTGA-GCAGGCTCGGGGGGGAAGGAAAAGGCAGGACTGACCTTACGCATAGCAGCAACAGCATGGCCAAAGGTGGCCTTGACCTCCACACGGTCTCGGATCCAGCCTGGCAGCTTTGCCAGGATGGGTGGGCGGGCATATCGCTGGTCTAGGAGCACTATGCTGGCAAAATCCCTCTGGTGCCTGATGGCTCTGCCTGGATGGGAACAGAATTTGGGGCTCCTAGGTAAA-------------------ATCCTCTCCTGTGACTTCATTCTC-------------------CAACCACCCAT--CTGTACTCC----------CAACTATCCATCCTGACAGCCAGGAGCAGTCCCAGGCTTACCTATAGATTGGTTGACAGCCTTCATACACAGATTCTCCACCAAGGCCTTCCCTGGTGGGGGCTGGCCTGGGGTTCTGGGCTGGGAAGGGTAGAAAGGACCTATCAGAACTGTTCCTTACCTCCTGTCTAGTGTTCTAGCTCTCCCTGGGAGAAGAGCCTGCCAGGCTTTGGAGCAAGACCAGGCAGCTTCACAAGCCAGTGCCAGCAGCTGG------CACGATGTCATGGAGAAGGTCAAGAGGGGGACAGGAAACACC--AGCATGGCAAGGAAGTCACAGCTACAAGACCCTGCTATCTCAG------CCTAGGGAATACACCACACTTCCCCCCGGCC--CTCTCCTCAT-CCTCTGGAATCCTGGAGGTACTCACAAGGGTCTGATCCAAGTAGGTCATCTTCTCTTGTAGTTCTGGAGAGTTGATGTTGGGGTAGGGCATGCCCACCATCACTACACACCTAGGTGGAGATGCACGCCGATGGGCATGTGGCCTCACACTCACTGAGTCCTCACCCACATGCCACCGACTGCT--GCTCTACCTCTGCTGCCG--CTCTTGGCTATGCTCGGCAGCTCTACCCTCCGCATC-CCGTACCTACCACCTGGAAAGGATTTTTTCAGCTAAGAGACCCAGTCTAAGCCAATGAACATAGTCCTGATAAGGTTATGGTTTGCCCCATTTTCCATCTGCTCT-CAAAGGCCCAATCCAGAGTTGCTGAAACACTTCCCGCCTGGCTGCCTGATCCTGAGCAGC--CAGCCTGGGTGCAGACTCAGATGGCATCAGATTGCAGGT-GGGGCCCAGCTGCTGGAAGTGAGGAGTAGCCAGGTGTCATAGCCCCAGGAGAGAAGGAGAGGACCACTACTTACCGGCCTAGGTTGTCAGAGAAGTTAATCCCTTCACTCATCTTTCCTCCAACC

2057523 chrY 57214466 57215088 NW_007726181v1 1023265 1023919 + 29358
GGCCCATCCCACTCTAGGCATGGCTCCTCTCCACAGGAAAACTCCACTCCAGTGCTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCT---GGAAGTCAGACACCTGCAGATCAAGACCACAGCATCAAGACCCTGTGACCTCTCAAAGGCCTGGTGGAAAGGA--------------CACGG-----------GAAGTCTGGGCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAA--GGCTGACGGCAAGTTAACAAAAAGAAA--AATGGTGAATGATACCCGGTGCTGGCAATCTCGTTTAAACTACATGCAGGAACAGCAAAGGAAATCCGGCAAATTT-GCGcagtcattctcaacaccggccatgcagcaaaatcatcagtggaaatttaaaaaaatacacgtggccaggccccagcccaaatcact-aataagaatctccaggg-CTtcacctgttagactggcaaaaaatccaaaag--taaacactttgtggagaaacaggcactcctagacattgctggtgggatacagaacagtacaattctga------------tggtaatcagttaacaaattaaacatatttattttatacttttaaacccaggaatcccatatttaggagtctactgagaccaaacagc
GGCTCGCTCCGCCCGGGTCACA-CTCCTCACCGCAGGAGAACTCCACCAC-TCGCTCAGCCTCAGCCCCAGCGCACGCCAGCAGCTGCTCCCGGAAGTCAGACACCTGCAGA------CCACAATGGCAGGGCCCTGTGACCTCCCAGAGGCACAGGGGAGAAGAACCTCAGGCCTCGGCATGGAGGGCAAGACAGAAGTCTGGGCTGGAAGGCAGCAAGTACGTACAAACAGAAAAAAGAGCTAAAAAAAAAAAGGCTAACAACAAATTAACAATAATAAATAAATTGTTAATGATATCCAGTGTTGGCAGTTTCATTTAAGCTACTGGTAGAAACAGCAAAGGAAATCTGGCAAACTTGGCAcagtgattctcaaccctggctatgcatcaaaaccagcagtgggaatttaaaaaaatacACATGGCCCAGCCACAGTCCAAACTACTGAATAACAATCTCCAGGGttttcacctaccaaattggc---aaatccgaaagtttaaccactctgtggagaaaaaggcatttttaaacattgctggtgcaatacagaatagtacaactcttacataggggaatttgacaat-acttaacaaattaaatgga-----tttttactttttaactcaggaatctcatatctgggactccacccagaatacacagc

2057524 chrX 68 170 NW_007727164v1 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA

답변1

일치 후에도 줄을 포함하도록 지시하는 스위치 grep와 함께 사용하는 것이 좋습니다 . -A이 같은:

#!/bin/bash
file=$1

for i in `seq 1 20`; do
  grep -A2 "chr$i " $file > seq_$i
done

grep -A2 "chrX " $file > seq_X
grep -A2 "chrY " $file > seq_X

그런 다음 다음을 실행합니다.

./extract.sh myfile

답변2

  • $2두 번째 열을 나타냅니다.
  • "chr19"이는 우리가 다음과 관련된 세부정보만 검색한다는 것을 의미합니다."chr19"
  • {c=3}c-->0검색 패턴의 다음 2줄을 출력하는 명령
awk '$2=="chr19"{c=3}c-->0' file > chr19_file

관련 정보