헤더 줄에 언급된 첫 번째 염색체를 기반으로 별도의 파일을 만들고 싶습니다. 24개의 염색체가 제목 줄에 언급되어 있으며 그 서열은 다음 두 줄에 언급되어 있습니다. 파일 구조는 다음과 같습니다:
헤더
인간 서열
기타 게놈 서열
그러나 모든 염색체 서열은 하나의 파일로 결합되어 있으며 이를 별도의 염색체 파일과 해당 서열 쌍으로 분할하고 싶습니다. 이를 위해 Python 스크립트를 만들었지만 클러스터에 병합된 대용량 파일을 업로드하는 데 시간이 많이 걸리고 종종 연결 오류가 발생합니다. 그래서 Bash 스크립트를 사용하고 싶습니다.
아이디어는 헤더 행의 두 번째 열에서 "chrY"(또는 크롬 이름이 무엇이든)를 검색한 다음 해당 헤더 행과 그 뒤에 오는 2개의 시퀀스 행을 별도의 파일에 붙여넣는 것입니다.
2057524 chrY 68 170 chrX 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA
세부 사항:
2057524 chrY (human chromosome) 68 170 chrX (other genome chromosome) 23685 23787 - 4125 -> header line
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA (human sequence)
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA (other genome sequence)
테스트 목적으로:
2057521 chr10 57211219 57211230 NW_007726181v1 1018288 1018299 + 575
CTGGGCACTATG
CTGAGCGCGGTG
2057522 chr2 57211231 57214400 NW_007726181v1 1018406 1021615 + 116172
GTTTtgagcttgt----acccagcgctgcttttgccttgctctgtgaccccaggcaagctgcctcacctctctgggccagtttccccat-cgtacagtggTGCTGCACACCCTGGCCCTGGCCC-CGAGGTGGCTGGGAGGTGGCTCCTCAAACAGCCGCTTTCTCATCAGTGCCCGGTGCTGGGT-CAGGGATCGACTGAGGCTCT--GAGCTAACTAGGAAACACAGTGGCCTTG--GAGGGCTGGGGAGTGTCATGGGGGTG---GGGACAGGGAGCCACCGGTCGCATGTGACTGAACTCTT-----------------CACCCCAGTCTGTGGCTTTCCCGTTGCAGTGAGAGCCACGAGCCAAGGTGGGCACTTGATGTCGGATCTCTTCAACAAGCTGGTCATGAGGCGCAAGGGTAGGAGGCAGGGCCGCTGCCCGCCCTGGGTCGGCACCT---------------TGTAATTCTGTCCTGCCTTTTTCTTCCTGTATTTAAGTCTCCGGGGGCTGGGGGAATCAGGGTTTCCCACCAACCACCCTCACTCAGCCTTTTCCC-TCCAGGCATCTCTGGGAAAGGACCT------GGGGCTGGTGAGGGGCCCGGAGGAGCCTTTGCCCGCGTGTCAGACTCCATCCCTCCTGTGCCCCCACCGCAACAGCCACAGGCAGAGGAGGACGAGGACGACTGGGAATCCTAGGGGGCTCCATGACACCTTCCCCCCCAGACCCAGACTTGGGCTGTTGCTCTGACATGGACACAGCCAGGACAAGCTGCTCAGACCTGCTTCCCTGG-GAGGGGGTGACGGAACCAGCACTG---------TGTG-GAGACCAGCTTCAAGGAGCGGAAGGCTGGCTTGAGGCCACACAGCTGGGGCG---GGGACTT-CTGTCTGCCTGTGCTCCATGGGGGGACGGCTCCACC--------CAGCCTGCGCCACTGTGTTCTTAAGAGGCTTCCAGAGAAAACGGCA-CACCAATCAATA-----------AAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGACCTCTAGCTGTGCAGGATGCAAACGTCTCGGGGTCAGTGACTGCCTCCTGCCCCTGTTGGTCCCTAGGCAGTGGGGGCAGAAGCTCCCAGCTGACCTG------TTTCTCTGGGAGAGAAGGGCAGTCAGCAGGGGCAGCTGTTGCAGATGGGAGGAATAG--------TCTCCCACA----AAAAAGGTTTCAGTGACAGACACGGGGTCTCTAAAAATAGTCATGCTGAGAGCCCAATGGCCCTTGGCACAATTGCTGGTGTTGGGGTAGAAGATGTCTTGGAGTTTGCTCAAGTGGTTGAGAGGGAGGGAGGTGCCATCAACTT---GGAGGAACTGGCACCAAGCCAGGGAGATAGAAATCCAGGCAAGGCTGTGGGGCAGGTTAGGGAGCAAGGCTGCAGGGGTGACTCAGGAAGAAGGTGGGGGAGGTGACAAGCCCCCAGGCAGGGGCCCTGTGGCC-------------ATGGGGATCTTTTTAAATTGAGACTAGGGGGTGAATAGTCCAGGGCAGCTAACTTTAGTTATTATAGAAAG-GGCAGTAGCAGATGGGTCTG-CTCCGTCTCGCTTCTAAGAAGGTGG---------------GCAGGACAAATGGCAGCCTCCTGCAGAGGCCCAGTGAGAAGCCTGGCCC-------TCGGCCAC-----ACAGGATGGAAGACAGATTGGATTCCACAGAGGGGAGCTGCCCTGGGAAGATCTCACGGATGGCCAGGACCCACCATTTCTTCGGGGTTCCCCT-GTTTTCTCCAACGGGCACTAATGCCTGTGCCTGGGTCCTGGCAACAC----------------------TCTGGACTCCACACTCT--TCTGGGTTTCACCTTTGTA-GCAGGATCCCTGCAGATCAGGCCCATGACAAACACCGTCTCCAGCGGGCAGAGCAAAGGAAGGGCGCAGCGCCAGGCAGTGGTGCAGCTGCCTGTCAGGAAGAGGCCTACTTCT---GGTGAAACTGGGCAGAC---AAAAGGCAGTGAGAAATGTGATCTCGGGGTGGTGGAGGCTC-TAGGGAAAGGAAAAGGCAGGAGTGAACTTCCACACAGCAGCAATGGCAGAACCAAAGGTGGCTTTGACCTCCACGAGGGCTCAGATCCAGGCCAACAGCTTGTCCAGGACAGGGTGCCGGGTGTATCACTAATCCAGGAGCACTATGCTGGCAGAATCCCTTTGGTGCCTGATGGCCCTGCCTTCGTGGGAACAGAGGCTAAGGCTTTGAGTTACAGCTGCCTCCCCAACAGTGCATCCCCTTCTCCTTCCTCAGCCTCAGGTAGGAGACAGGGCAGGCAACCCCCCTTTCCTCTTCTCCCCTTCTCCAGCCCCTGTCTGTCCACCCAGCTGGAGGCAG--CCAGGCTTGCCTATGGACTGGTTGACAGCCTTCATGCACAGGTTCTCCACCAGAGCCTTTCTTGGGGGCCCCTGGCT--GGGCTCTGAGCTGGGAGTGAAGGGGATGACCCATGCGGACTGTTTGCTGC-------------TTGTAGCTTTCCCTGGGA-AAGACTCTGCCAGGCCTTGGAGCCAGACCAGGAGGCTTTATAGGCCACTGCAAGCAGCAGGGCTCCAGATGACATCACAGGGAATATCAAGAGGGTGTGGAGGGGCATCGAAGCCTCTCCAGGAG---ACAG----GAGAC---GCCGGCCCAGTAGAGCCCTAGGGGCGACGCCACTCCCACTCACTGTCTACTCTCCTCTCACCTCTGCAACACTGGGGACACTCACAAGATTGTGATCCAAGTCGGCCGTCGTCTTCTGCAGCTCTGGAGACCTGATGCTGGGGAAGGGCATGCCTGGCATCACCACACACCTGGGAGGAGACAGGAGCCTG-GGGCCGGTGG---------------------GCCCACACATCACCAGCTGCTCCGTTCTACCATTTCTTCAGCCCTCTTGGCTGTGC-CTGCGGCTCTGCCCCTCCCCTCTCTGCACCTACCACCCAGAGAGGGCTTGTTGAGCTCAGAGATCCCACCTAGGCCAATCCACTGGGTTCTGTGGCAGCGATGGCCTGCCTGATCTTCCACCTGCTCTCCCAGGGCCAAAGCCAGACCTGCTGAGCCCCTCCC--TCCAGCCGGCTGGT-CTGAGCAGTCACAGCCCGGCTTTGGGCTCCGATGGCAGCAGATGGCAGGTAGGGGTCCAGCTGCTGG-AGCGAGGGCCGGCCACGTATCACAG-CCAAGGAGATGAGCACAAG--CACTACTTACTGGCCTAGGTTGTCAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAATC
gtcctgagtttgccaaggcccagctctgcttctgacttgtcctgtg-----agacaaagtgcctaacgtctttgggccagtttcctcatccccacagtggggctgcaca-cctgcccgtgtcttacaggatggccgtgatgt------tca-----CCATTTTCTTAT-AGTACCCACTGCCAGATACACGGACAGACCAAAGCTCCCAGAGCTCA-TGGTGAACAT-GTGGCTGTGGAGAGGGCTGGGGACTGTTGCAGGGGCAAGTGAGCCAAGAGGGCACTGG-CGACTGGGCCTGGAGCCTCCCGACTTGGCCCCCGCACCCTCCCACCTGCAGCTTTCCTCTTGCAGTGAGAGCCACAAGCCAAGGTGGAGACCTGATGTCAGATCTCTTCAACAAGCTGGTCATGAGGCGCAAAGGTAGGAGGCAGAGGGGCCGCC--TTCAGGGCAGGGGCCTCAGGGTGTCCTGCAGTGTACTTCTGTTCTGCCTCTTTCTTCCTGTTTTTAAATCTTCAGGGGCTGTGTGACCTGGGGCCTCCATACACCCTCCCTCA--CAGCCTTTTCCCTTCCAGGTATCTCCGGGAAAGGACCTGGAACAGGGGCCAGCGAGGGGCCAGGAGGAGCCTTCGCCCGAATGTCAGACTCCATCCCGCCTCTGCCTCCCCCACAGCAGCCAC---CGGGAGAGGACGAGGATGACTGGGAATCCTAGGGGTCT-CAGCACTCCTTCCTCCCCCAACCCAGACTTGGGCTGTGGCCCTGAGACAGACACAGCTGGGACA--------------GCCCCCTTGGTGAGACAGGGATGGTG-CAGGACTGCCCTACGTCTGTGCTGGGCCTTCTTCAGGGAGCGGGTAGGTTGCATGAAACCATAAGTGTGGGGTGGGAGGGGCTCGCTCTCCACCTGTGCCCCACCGTGTGCCTGCTCTACCCACCCCTTCAGCGTGTGCTCCTCTTCCCGAAAGAGACT--CGAAGAAAACAGCACCATGAATCAATAAAGGACGATGTAAGAACTGAGCATAAACCAACAGTGCACTTTTAATTAAGGAGTCAAGGCTGGGTGGCTTGCAAACATCTGAGAACCAGTGACTG--TCCTGCCCC-GTGGGTCTCCAGGCAAT-GGGGCAGAACATCTGAGTGGACCAGGGCCCCTTGCACTGGCTCGAAGGTGCAGTCAGCAGGGGCAGCTGCTGTGGATGGGAGGGAGGGAGGGAGATGTTCCCACGGGATAAAGATGTCTCAGTGACAGACATGGGGTCTCTAAAAATAGTTGTGCTGAGAGCCTAATGGCCCTTGGCATAATTGCTGATGTCAGGGTAGAAGGTGTCTTGGAGTTTGCTCAAGTGCCTGAGAGGGAAGGAGGTGCCATCAACTTGGAGGAGGAATGGGAGCCAAGCCAGAGAGA-AAAGCCCTGCGCGGAGCTGTGGAGCAGACCA--GAGCACAGCTG-----------------------------------AGGCTGGCAGG-AGGAGCCGTGTGGACAGCAGAACTAGAAATGGGGAACGTTTTGAGT------------GTGAAATGTCTAGAACAGCTCATTTTAGCTAGGATGAACAGAGGCAG-----GATGGGCCTGTTTCCATCGGACCTCTGAGAAGGTGGCTACTGAGAAAACATGCAGGACAGAAG-----CTGCAGCAGAACACCGGGCAGGAGCCTGGCGCGGCCAGTGTGGCCACACTAAACAGGGAGGAAGATGCAATGG------CAGGGAGCAGCTGCCCTGCAGTGGGCTCAAGGGCAGTCAGGACCCACTGTTTACTCAGGATCAACCTAGTTTTCTCCAACTGGCTTTTCTACCTGGGCCTGCATGCGGGCAGCCCACTGATGCTGGAAGGGGGCTGGTCTGGACCTCACACTCTACACCTGGTTTCACCTTCTTAGGCAGGATCCCTGTAGACCAGGCCCAAGACAAACACCATTCTAAGTGGGCAGGGTAAAGGAAGAGC------CCGGGC--TGGTGCAGCCATCCATCAGGAACGGCCAAACTTCTCCCGATGAAACTGGGGAGATGGGAAAAGGCAGTGAGAGACTAGATCTCAGGGTGA-GCAGGCTCGGGGGGGAAGGAAAAGGCAGGACTGACCTTACGCATAGCAGCAACAGCATGGCCAAAGGTGGCCTTGACCTCCACACGGTCTCGGATCCAGCCTGGCAGCTTTGCCAGGATGGGTGGGCGGGCATATCGCTGGTCTAGGAGCACTATGCTGGCAAAATCCCTCTGGTGCCTGATGGCTCTGCCTGGATGGGAACAGAATTTGGGGCTCCTAGGTAAA-------------------ATCCTCTCCTGTGACTTCATTCTC-------------------CAACCACCCAT--CTGTACTCC----------CAACTATCCATCCTGACAGCCAGGAGCAGTCCCAGGCTTACCTATAGATTGGTTGACAGCCTTCATACACAGATTCTCCACCAAGGCCTTCCCTGGTGGGGGCTGGCCTGGGGTTCTGGGCTGGGAAGGGTAGAAAGGACCTATCAGAACTGTTCCTTACCTCCTGTCTAGTGTTCTAGCTCTCCCTGGGAGAAGAGCCTGCCAGGCTTTGGAGCAAGACCAGGCAGCTTCACAAGCCAGTGCCAGCAGCTGG------CACGATGTCATGGAGAAGGTCAAGAGGGGGACAGGAAACACC--AGCATGGCAAGGAAGTCACAGCTACAAGACCCTGCTATCTCAG------CCTAGGGAATACACCACACTTCCCCCCGGCC--CTCTCCTCAT-CCTCTGGAATCCTGGAGGTACTCACAAGGGTCTGATCCAAGTAGGTCATCTTCTCTTGTAGTTCTGGAGAGTTGATGTTGGGGTAGGGCATGCCCACCATCACTACACACCTAGGTGGAGATGCACGCCGATGGGCATGTGGCCTCACACTCACTGAGTCCTCACCCACATGCCACCGACTGCT--GCTCTACCTCTGCTGCCG--CTCTTGGCTATGCTCGGCAGCTCTACCCTCCGCATC-CCGTACCTACCACCTGGAAAGGATTTTTTCAGCTAAGAGACCCAGTCTAAGCCAATGAACATAGTCCTGATAAGGTTATGGTTTGCCCCATTTTCCATCTGCTCT-CAAAGGCCCAATCCAGAGTTGCTGAAACACTTCCCGCCTGGCTGCCTGATCCTGAGCAGC--CAGCCTGGGTGCAGACTCAGATGGCATCAGATTGCAGGT-GGGGCCCAGCTGCTGGAAGTGAGGAGTAGCCAGGTGTCATAGCCCCAGGAGAGAAGGAGAGGACCACTACTTACCGGCCTAGGTTGTCAGAGAAGTTAATCCCTTCACTCATCTTTCCTCCAACC
2057523 chrY 57214466 57215088 NW_007726181v1 1023265 1023919 + 29358
GGCCCATCCCACTCTAGGCATGGCTCCTCTCCACAGGAAAACTCCACTCCAGTGCTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCT---GGAAGTCAGACACCTGCAGATCAAGACCACAGCATCAAGACCCTGTGACCTCTCAAAGGCCTGGTGGAAAGGA--------------CACGG-----------GAAGTCTGGGCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAA--GGCTGACGGCAAGTTAACAAAAAGAAA--AATGGTGAATGATACCCGGTGCTGGCAATCTCGTTTAAACTACATGCAGGAACAGCAAAGGAAATCCGGCAAATTT-GCGcagtcattctcaacaccggccatgcagcaaaatcatcagtggaaatttaaaaaaatacacgtggccaggccccagcccaaatcact-aataagaatctccaggg-CTtcacctgttagactggcaaaaaatccaaaag--taaacactttgtggagaaacaggcactcctagacattgctggtgggatacagaacagtacaattctga------------tggtaatcagttaacaaattaaacatatttattttatacttttaaacccaggaatcccatatttaggagtctactgagaccaaacagc
GGCTCGCTCCGCCCGGGTCACA-CTCCTCACCGCAGGAGAACTCCACCAC-TCGCTCAGCCTCAGCCCCAGCGCACGCCAGCAGCTGCTCCCGGAAGTCAGACACCTGCAGA------CCACAATGGCAGGGCCCTGTGACCTCCCAGAGGCACAGGGGAGAAGAACCTCAGGCCTCGGCATGGAGGGCAAGACAGAAGTCTGGGCTGGAAGGCAGCAAGTACGTACAAACAGAAAAAAGAGCTAAAAAAAAAAAGGCTAACAACAAATTAACAATAATAAATAAATTGTTAATGATATCCAGTGTTGGCAGTTTCATTTAAGCTACTGGTAGAAACAGCAAAGGAAATCTGGCAAACTTGGCAcagtgattctcaaccctggctatgcatcaaaaccagcagtgggaatttaaaaaaatacACATGGCCCAGCCACAGTCCAAACTACTGAATAACAATCTCCAGGGttttcacctaccaaattggc---aaatccgaaagtttaaccactctgtggagaaaaaggcatttttaaacattgctggtgcaatacagaatagtacaactcttacataggggaatttgacaat-acttaacaaattaaatgga-----tttttactttttaactcaggaatctcatatctgggactccacccagaatacacagc
2057524 chrX 68 170 NW_007727164v1 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA
답변1
일치 후에도 줄을 포함하도록 지시하는 스위치 grep
와 함께 사용하는 것이 좋습니다 . -A
이 같은:
#!/bin/bash
file=$1
for i in `seq 1 20`; do
grep -A2 "chr$i " $file > seq_$i
done
grep -A2 "chrX " $file > seq_X
grep -A2 "chrY " $file > seq_X
그런 다음 다음을 실행합니다.
./extract.sh myfile
답변2
$2
두 번째 열을 나타냅니다."chr19"
이는 우리가 다음과 관련된 세부정보만 검색한다는 것을 의미합니다."chr19"
{c=3}c-->0
검색 패턴의 다음 2줄을 출력하는 명령
awk '$2=="chr19"{c=3}c-->0' file > chr19_file