Grep이 작동하지 않음 - 변수가 있는 파일 검색

Question 1

위 스크립트는 MKD_nsi_lib1_R1_001.fq파일을 48번 grep합니다. 파일 크기가 작지 않으면 스크립트 속도가 매우 느려집니다.

또한 Barcodes.txt 에 대해 cat48 번 실행됩니다 sed. 속도는 빠르지는 않지만 .fq 파일을 48번 읽는 것만큼 "비싸지"(시간 및 디스크 I/O 측면에서)는 아닙니다.

동일한 입력 파일에서 여러 번 실행하는 대신 grep필요한 작업을 한 번에 수행하는 awk 또는 perl 스크립트를 작성하십시오(파일이 클수록 더 barcodes.txt좋습니다 MKD_nsi_lib1_R1_001.fq).

이 같은:

#!/usr/bin/perl

use strict;
# %patterns is a hash where the keys are fixed-text
# strings, and the values are file-handles to 
# files opened for append.
my %patterns;

# First open the barcodes.txt file and read it into
# the %patterns hash
my $barcodes;
open($barcodes,'<','barcodes.txt') || 
  die "Couldn't open 'barcodes.txt' for read: $!\n";

while(<$barcodes>) {
  chomp; # strip the newline at the end of each line
  my $outfile = "GrepBarcode_$_.txt";
  open($patterns{$_}, ">>", $outfile) ||
    die "Couldn't open '$outfile' for append: $!\n";};
close($barcodes);

# Now process the .fq file(s) listed on the command line.
# also works with stdin.
while(<>) {
  # this assumes that the keyword is at the start
  # of the line and is followed by whitespace. This
  # is only a guess on my part, since you didn't describe
  # or provide a sample of your file.  If there's a different
  # delimiter in the input file, adjust the regex in the split
  # function.
  my ($p,undef) = split /\s+/, $_, 2;

  if (defined($patterns{$p})) {
    print { $patterns{$p} } $_;
  };
};

실행하려면 파일(예 split-fq.pl: ) 에 저장하고 chmod +x split-fq.pl다음을 사용하여 실행 가능하게 만듭니다.

./split-fq.pl MKD_nsi_lib1_R1_001.fq

이는 비교 시 고정 문자열을 사용하도록 작성되었습니다 MKD_nsi_lib1_R1_001.fq. 각 입력 줄에서 첫 번째 "단어"를 추출하고 그것이 해시의 키인지 확인합니다. %patterns그렇다면 현재 줄은 관련 파일 핸들에 쓰기입니다.

그러나 정규 표현식을 사용할 수는 있지만 속도가 느립니다.

#!/usr/bin/perl

use strict;

# %patterns is a hash where the keys are pre-compiled
# regular expressions anchored to the start of line ^,
# and the values are handles to files opened for append.
my %patterns;

my $barcodes;
open($barcodes,'<','barcodes.txt') ||
    die "Couldn't open 'barcodes.txt' for read: $!\n";

while(<$barcodes>) {
  chomp;
  my $outfile = "GrepBarcode_$_.txt";
  open($patterns{qr/^$_/}, ">>", $outfile) ||
    die "Couldn't open '$outfile' for append: $!\n";
};

close($barcodes);

while(<>) {
  MATCH: foreach my $re (keys %patterns) {
    if (m/$re/) {
      print { $patterns{$re} } $_;
      last MATCH; # no need to test any more patterns against current line
    };
  };
};

이는 위의 고정 텍스트 버전보다 느리지만 grep쉘 for루프에서 48번 실행하는 것보다 훨씬 빠릅니다. 48번이 아니라 .fq 파일을 한 번만 읽어야 합니다.

참고: 이는 유사한 작업을 수행할 수 있는 방법의 예일 뿐입니다. 귀하의 파일에 무엇이 있는지 모르기 때문에 그들이 귀하의 데이터를 올바르게 처리하는지 모르겠습니다. 귀하는 바코드.txt 또는 .fq 파일의 예를 제공하지 않았습니다. 실제 데이터에 맞게 스크립트를 수정해야 하는 경우가 거의 확실합니다.

또한 fastq 파일을 분할하기 위한 더 나은 도구가 이미 존재할 수도 있습니다. 실제로 Perl로 작성된 생물정보학 스크립트와 도구로 구성된 거대한 라이브러리가 다음과 같습니다.https://bioperl.org/

Python을 선호한다면 다음을 참조하세요.https://biopython.org/

물론 생물정보학 관련 질문을 다루는 Stack Exchange 사이트도 있습니다.https://bioinformatics.stackexchange.com/

다음 버전은 귀하가 제공한 샘플 데이터에 작동합니다.

첫 번째 고정 문자열 버전과 유사하게 작동하지만(속도도 거의 동일해야 함) 콜론( )을 필드 구분 기호로 사용하여 .fq 파일의 각 입력 행을 두 개의 필드(변수 $num및 )로 분할합니다.$data:

그런 다음 Perl substr()함수를 사용하여 의 처음 5자를 추출하여 $data이라는 다른 변수에 넣습니다 $start.

$start값이 있는 키가 배열에 있는 경우 %patterns현재 줄( $_)이 관련 출력 파일( 의 파일 핸들 $patterns{$start})에 기록됩니다.

#!/usr/bin/perl

use strict;
my %patterns;
my $barcodes;

open($barcodes,'<','barcodes.txt') || 
  die "couldn't open 'barcodes.txt' for read: $!\n";
while(<$barcodes>) {
  chomp;
  my $outfile = "GrepBarcode_$_.txt";
  open($patterns{$_},">>","$outfile") ||
    die "couldn't open '$outfile' for append: $!\n";
};
close($barcodes);

while(<>) {
  my ($num,$data) = split /:/, $_, 2;
  my $start = substr($data,0,5);

  if (defined($patterns{$start})) {
    print { $patterns{$start} } $_;
  };
};

테스트하기 위해 실행했을 때 빈 출력 파일이 생성되었습니다 . 이는 코드에 5자 코드와 일치하는 GrepBarcode_?????.txt행이 없기 때문입니다 . examplefile.txtBarcodes.txt에 추가하면 AAAGA다음 내용의 파일이 생성됩니다.GrepBarcode_AAAGA.txt

$ cat GrepBarcode_AAAGA.txt 
6:AAAGAGAAATGTAATTTATACATACAGTACATATATATATGGCAGCTGTCTCCCCAAATCCTGCTCTACTGCGTCATTGTTGTGGGAATTATTCCTGGGAGGGATGCGTGAAAAATGCAAGGATATGTGCCAAGAGTACTGCAGCACTA
10:AAAGACACTGCAGATAAACCCTGTGTAATAAATACATAAAATATGTTCCAACCATTTTTATAAATTTTCTGAGTAATCTGTGTTGGATTTTCAGAGTAAGCAAATGAGAAATTAGAGTATTTGATTCCCTGTTGCTTATCCAGGACTTT

Answer