fastq 파일에서 읽은 내용 삭제

fastq 파일에서 읽은 내용 삭제

fastq 파일 4줄을 삭제하고 싶습니다. 예를 들어 일반적으로 파일은 다음과 같습니다. (각 샘플에 대해 4줄)

@M04241:303:000000000-BR896:1:1102:21438:12389 1:N:0:TATGGCAC
TGTCAGCCGCCGCGGTAATACGGAGGGTCCGAGCGTTATCCGGAATTATTGGGTTTAAAGGGTCCGCAGGCGGGCTTATAAGTCAGGGGTGGAATGGTGCGGCTCAACCGTAGCACTGCCCTTGATACTGTTAGTCTTGAGTTATGGTGGAGTGGCCGGAATATGTAGTGTAGCGGTGAAATGCATAGATATTACATAGAACACCGATCGCGAAGGCAGGTCACTAACCATTTGACTGACGCTGATGGACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGGAAACGATGGATACTAGCTGTCGGGCACTTGTGCTCGGCGGCCAAGCGAAAGTGATAAGTATCCCACCTGGGGAGTACGTGCGCAAGAATGAAACTCAAATGAATTGACGG
+
EGGGGGGGGGGGGGGGGGGGGGGGDE@FFGEEEGGGGDGFEFGGGGGGGGGGGGGGGGGGGGGGGDGEFFGGGCGGFDF<DGGFGGGGGGGG7FFG?FDF:FGGGFCGGGGFEGGGF:>GGGG>F>DE@GG6@GGG@G9<EGGGG9FGGGGGG7FGGDDEFGGGGGGGGGGGGGGGGCEFGGGGFG?EFFCFGGGGGGFGG?GGGGGGGG=EGEGGGGGGGGGGGFGCGGFGGGGCFFF6CD7DDFFFFFED9:BFCBEE@DEF:@EGCFCF@FFFD?=A:CFEF0<C<A>FB>@6+C,@GFFGFDGGF<AFEFB+FEECGFF9FDFAC6@+:@FC:GFC,CFC,EFGE,9FFCGFF<@;6:,FD,D:FGGFFGF7@8+7,,CF<<6CF<CC-CA@<GEGFE@6@A,CB
@M04241:303:000000000-BR896:1:1103:11464:7575 1:N:0:TATGGCAC
GTCAATTTCTTTGCGTTTCAATCTTGCGATCGTACTCCCCAGGTGGGATACTTATCACTTTCGCTTAGTCACTGAGATAAATCCCAACAACTAGTGTCCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGTCCATCAGCGTCAGTATATGGTTAGTGACCTGCCTTCGCGATCGGTGTTCTATGTAATATCTATGCATTTCACCGCTACACTACATATTCCGGCCACTCCACCATAACTCAAGACTAACAGTATCAAAGGCAGTGCTACGGTTGAGCCGCACCATTTCACCCCTGACTTATCAGCCCGCCTGCGGACCCTTTAAACCCAATAATTCCGGATAACGCTCGGACCCTCCGTATTACCGCGGCTGCTGGC
+
CCCCCGGGGGGGG-FCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGFFGGDFFGFGFGGGGGGGGGGGGGGGGGGGGGGGGGEGGEGGGGDGGG4FFGGGGGGGGGGGGGGGGGGGGGEGGGGGGFGGGFFGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFGFFFGFGGGGGGGGGGGGGGGGGGGFGGFFGGGGGGGGGGGGGGGGGGGCDGGGGGGGGFCFGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGFGGGGGCGEFFGGEGGGGGGGGGGGGGGGGGDGGGGFFCGGGGGGGGGGGGFGGGDGGGGGGGGGGGGFGGGGGGGGGGGGGGGGG
@M04241:303:000000000-BR896:1:1103:23291:21403 1:N:0:TATGGCAC
CTGCGGCACCGCAGGGCAAGCCCCCCGACGCCTAGCCCACATCGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTGCCGGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCACCCAATATCTACGAATTTCACCTCTACACTGGGTATTCCACCCTCCTCTTCCGGACTCGAGCACCGCAGTCTCGGCTGCACCTCCGGGGTTGAGCCCCGGGCTTTCACAGCCGACTTGCGACGCCGCCTACGCGCCCTTTACGCCCAGTGATTCCGAACAACGCTAGCACCCTCCGTCTTACCGCGGCGGCTGAC
+
CCCCCGGGGGG>FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG@@FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

그런데 아래와 같이 한 샘플의 4개 행 중 2개가 비어 있음을 발견했습니다.

@M04241:303:000000000-BR896:1:1103:11464:7575 1:N:0:TATGGCAC

+

@M04241:303:000000000-BR896:1:1103:23291:21403 1:N:0:TATGGCAC
CTGCGGCACCGCAGGGCAAGCCCCCCGACGCCTAGCCCACATCGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTGCCGGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCACCCAATATCTACGAATTTCACCTCTACACTGGGTATTCCACCCTCCTCTTCCGGACTCGAGCACCGCAGTCTCGGCTGCACCTCCGGGGTTGAGCCCCGGGCTTTCACAGCCGACTTGCGACGCCGCCTACGCGCCCTTTACGCCCAGTGATTCCGAACAACGCTAGCACCCTCCGTCTTACCGCGGCGGCTGAC
+
CCCCCGGGGGG>FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG@@FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@M04241:303:000000000-BR896:1:1103:26180:21941 1:N:0:TATGGCAC
CCGCCAATTTCTTTGAGTTTCAGCCTTGCGACCATACTCCCCAGGCGGGGTACTTAACACTTTTGATTCGGCAGTGCACCCATGTTAGTCCACTACCTAGTACCCATCGTTTAGGGCTAGGACTACCGGGGTATCTAATCCCGTTCGCTACCCTAGCTTTCGCGCCTCAGCGTCAGAAGAGGTCCAGCACGTCGCTTTCGCCACCGGCGTTCCTTCCGATCTCTACGCATTTCACCGCTCCACCGGAAGTTCCACATGCCCCTACCTCCCTCGAGATTGGCAGTTTCGAAGGCAGTTCTACAGTTGAGCTGCAGGATTTCACCTCCGACTGACCTATCCGCCTACGCGCCCTTTAAGCCCAGTGATTCCGAACAACGTTCGC
+
CCCCCGEGGGGGGGGGGEGGGGGGGGGGDFGGGGGGGGGGGGGEGGGGGGEFGGGFFFFGGGGGG,CEFGGGGGGGGGG?GGGGGG9FFGGGGGGGCGGGGGGGGGCFGGGG@GGGGGFGGGGGGGGGCGGFGGGGGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGDEGGGGGGGDGGGGFGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGDGEFGGEEGGGGFGGGGGGGGGGGGGGGGGGGGGEF?GGGEGGEEFEFFDFFGFGGFGGGGGGFFFGFGGGGGGGGGFGGGGFCGGGGGGGGGFFGGGGGGGGGGGGGGGFF@7GGGGGGGGGGGGGGGFDFCGGGGFEFGGFGGGGGGGGFGFEGGGG
@M04241:303:000000000-BR896:1:1102:21438:12389 1:N:0:TATGGCAC
TGTCAGCCGCCGCGGTAATACGGAGGGTCCGAGCGTTATCCGGAATTATTGGGTTTAAAGGGTCCGCAGGCGGGCTTATAAGTCAGGGGTGGAATGGTGCGGCTCAACCGTAGCACTGCCCTTGATACTGTTAGTCTTGAGTTATGGTGGAGTGGCCGGAATATGTAGTGTAGCGGTGAAATGCATAGATATTACATAGAACACCGATCGCGAAGGCAGGTCACTAACCATTTGACTGACGCTGATGGACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGGAAACGATGGATACTAGCTGTCGGGCACTTGTGCTCGGCGGCCAAGCGAAAGTGATAAGTATCCCACCTGGGGAGTACGTGCGCAAGAATGAAACTCAAATGAATTGACGG
+
EGGGGGGGGGGGGGGGGGGGGGGGDE@FFGEEEGGGGDGFEFGGGGGGGGGGGGGGGGGGGGGGGDGEFFGGGCGGFDF<DGGFGGGGGGGG7FFG?FDF:FGGGFCGGGGFEGGGF:>GGGG>F>DE@GG6@GGG@G9<EGGGG9FGGGGGG7FGGDDEFGGGGGGGGGGGGGGGGCEFGGGGFG?EFFCFGGGGGGFGG?GGGGGGGG=EGEGGGGGGGGGGGFGCGGFGGGGCFFF6CD7DDFFFFFED9:BFCBEE@DEF:@EGCFCF@FFFD?=A:CFEF0<C<A>FB>@6+C,@GFFGFDGGF<AFEFB+FEECGFF9FDFAC6@+:@FC:GFC,CFC,EFGE,9FFCGFF<@;6:,FD,D:FGGFFGF7@8+7,,CF<<6CF<CC-CA@<GEGFE@6@A,CB

이 빈 줄을 어떻게 감지하고 fastq 파일에서 제거할 수 있나요? 줄 번호는 알지만 평소에는 열 수 없는 대용량 파일이므로 이 두 줄이 비어 있는지 감지하고 이 샘플과 관련된 네 줄을 삭제하는 명령이 필요합니다.

감사해요! !

답변1

sed 'N;N;N;/\n\n/d' file.fastq >new-file.fastq

이는 FastQ 레코드의 네 줄을 읽은 다음 두 개의 연속된 개행이 포함되어 있는지 확인합니다. 그렇다면 전체 레코드가 무시됩니다. 그렇지 않은 경우 인쇄하십시오. 이는 파일의 모든 항목에 대해 반복됩니다. 인쇄된 모든 기록은 새 파일(여기 new-file.fastq)로 이동됩니다.

주석이 포함된 스크립트 sed:

         # (implicit: read a line)
N;       # read a second line, append it to the pattern space with embedded \n in-between
N;       # read a third line
N;       # read a fourth line
/\n\n/d  # if there are two consecutive newlines, delete and continue from top
         # (implicit: print)

동료들의 의견:

Fastq 레코드는 일반적으로 쌍을 이루며, 쌍을 이루는 파트너를 찾을 수 없으면 소프트웨어는 쌍이 누락되었음을 명시적으로 알리지 않고 울화통을 터뜨리는 경향이 있습니다. 쌍을 유지하고 분리된 기록을 분리하는 trimomatic과 같은 최소 길이 옵션이 있는 여러 도구가 있습니다.

이는 파일의 읽기가 쌍을 이루고 그 쌍 중 하나가 비어 있는 경우 단순히 빈 레코드를 제거하면 쌍이 엉망이 된다는 것을 의미합니다.

Null 읽기를 제거하기 위한 피팅은 기존 생물정보학 도구를 사용하지 않는 한 훨씬 더 복잡합니다. 표준 Unix 도구 상자의 도구를 사용하면 빈 읽기를 별도의 파일에 저장한 다음 FastQ 헤더를 사용하여 해당 맞춤을 스캔하고 제거할 수 있습니다.

질문에 표시된 데이터는 페어링되지 않은 읽기로만 나타납니다.

관련 정보