특정 패턴으로 3개 파일의 특정 줄을 병합합니다.

Question 1

이를 통해 awk다음을 실행할 수 있습니다.

awk '/^ko[^:]/{fn=$1;next};/./{id=fn$1;if (!(seen[id]++)){print > fn}}' file[123]

각 헤더 행에서는 식별자를 ko*****로 저장하고, 하위 헤더 행에서는 ^1을 배열의 인덱스 로 fn저장하며 , 처음으로 표시되는 경우 해당 행을 씁니다 .fn$1idseenidfn

^{1: 당신은 또한 사용할 수 있습니다fn$0}

Answer

이를 통해 awk다음을 실행할 수 있습니다.

awk '/^ko[^:]/{fn=$1;next};/./{id=fn$1;if (!(seen[id]++)){print > fn}}' file[123]

각 헤더 행에서는 식별자를 ko*****로 저장하고, 하위 헤더 행에서는 ^1을 배열의 인덱스 로 fn저장하며 , 처음으로 표시되는 경우 해당 행을 씁니다 .fn$1idseenidfn

^{1: 당신은 또한 사용할 수 있습니다fn$0}

Question 2

몇 가지 마법의 슈퍼 매시업 명령이 있을 수 있지만 때로는 "선형"이 이해하고 유지하기 가장 쉽습니다.

따라서 헤더 행을 기반으로 파일 이름을 추적하고 데이터를 추가하면 됩니다. 그런 다음 sort -u결과를 전달하여 고유한 행을 얻을 수 있습니다.

#!/bin/bash

# Clean out old results from previous runs
/bin/rm -f ko*

for file in $@
do
  filename=UNKNOWN
  echo Processing $file
  while read -r line
  do
    case $line in
      ko:*) printf "%s\n" "$line" >> $filename ;;
       ko*) filename=${line%% *} ; echo Switching to $filename ;;
        "") # Do nothing
            ;;
         *) echo Ignoring unknown line: $line
    esac
  done < $file
done

for file in ko*
do
  echo Making unique: $file
  sort -u -o $file $file
done

세 가지 소스 파일을 사용하여 실행할 수 있습니다.

$ ./pattern_split file1 file2 file3
Processing file1
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file2
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file3
Switching to ko00980
Switching to ko00982
Switching to ko00983
Making unique: ko00980
Making unique: ko00982
Making unique: ko00983

세 개의 고유한 파일이 생성되는 것을 볼 수 있습니다. 첫 번째를 보세요:

$ cat ko00980
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]

이제 해결책은 다음과 같습니다.굳어진데이터 파일의 악성 데이터를 표적으로 삼는 경우(예: 파일이 있으면 어떻게 되나요 ko123/456? 손상될 수 있습니다. 하지만 이는 문제를 해결하는 방법에 대한 개요입니다.

Answer

몇 가지 마법의 슈퍼 매시업 명령이 있을 수 있지만 때로는 "선형"이 이해하고 유지하기 가장 쉽습니다.

따라서 헤더 행을 기반으로 파일 이름을 추적하고 데이터를 추가하면 됩니다. 그런 다음 sort -u결과를 전달하여 고유한 행을 얻을 수 있습니다.

#!/bin/bash

# Clean out old results from previous runs
/bin/rm -f ko*

for file in $@
do
  filename=UNKNOWN
  echo Processing $file
  while read -r line
  do
    case $line in
      ko:*) printf "%s\n" "$line" >> $filename ;;
       ko*) filename=${line%% *} ; echo Switching to $filename ;;
        "") # Do nothing
            ;;
         *) echo Ignoring unknown line: $line
    esac
  done < $file
done

for file in ko*
do
  echo Making unique: $file
  sort -u -o $file $file
done

세 가지 소스 파일을 사용하여 실행할 수 있습니다.

$ ./pattern_split file1 file2 file3
Processing file1
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file2
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file3
Switching to ko00980
Switching to ko00982
Switching to ko00983
Making unique: ko00980
Making unique: ko00982
Making unique: ko00983

세 개의 고유한 파일이 생성되는 것을 볼 수 있습니다. 첫 번째를 보세요:

$ cat ko00980
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]

이제 해결책은 다음과 같습니다.굳어진데이터 파일의 악성 데이터를 표적으로 삼는 경우(예: 파일이 있으면 어떻게 되나요 ko123/456? 손상될 수 있습니다. 하지만 이는 문제를 해결하는 방법에 대한 개요입니다.

Question 3

그렇다면 파일의 줄을 헤더를 기반으로 별도의 파일로 이동하시겠습니까?

나는 이와 같은 것이 트릭을 수행할 것이라고 생각합니다.

#!/usr/bin/env perl
use strict;
use warnings 'all'; 

#hash of output filehandles. 
my %output_files; 

#detect dupes
my %seen; 

my $ko_num = 'NULL'; 

#<> is the 'magic' filehandle. You can either use it to iterate STDIN
#or take a list of file names on the command line (just like sed/grep etc.)
while ( my $line = <> ) { 
   #see if the line starts with 'ko':
   if ( $line =~ m/(^ko\d+)/) {  
       $ko_num = $1;
       #open a new file - for overwriting (so we only do this once)
       open ( $output_files{$ko_num}, '>', $ko_num ) or die $! unless $output_files{$ko_num}; 
       #skip printing - could write a header here instead. 
       next;
   }
   #look for a 'K' number. 
   if ( my ($K_id) = $line =~ m/ko:(K\d+)/ ) {
       #skip it if we've already seen this combination of 'ko' number 
       #and k number.    
       next if $seen{$ko_num}{$K_id}++; 
       #print the output to this particular output file. 
       print {$output_files{$ko_num}} $line; 
   }
}
#close the filehandles. 
close ( $_ ) for values %output_files;

따라서 이 방법으로 "myscript.pl file1.txt file2.txt file3.txt"를 실행할 수 있으며 확장 가능한 방식으로 올바른 작업을 수행해야 합니다. 별도의 파일인지 단일 스트림인지는 중요하지 않습니다.

Answer

그렇다면 파일의 줄을 헤더를 기반으로 별도의 파일로 이동하시겠습니까?

나는 이와 같은 것이 트릭을 수행할 것이라고 생각합니다.

#!/usr/bin/env perl
use strict;
use warnings 'all'; 

#hash of output filehandles. 
my %output_files; 

#detect dupes
my %seen; 

my $ko_num = 'NULL'; 

#<> is the 'magic' filehandle. You can either use it to iterate STDIN
#or take a list of file names on the command line (just like sed/grep etc.)
while ( my $line = <> ) { 
   #see if the line starts with 'ko':
   if ( $line =~ m/(^ko\d+)/) {  
       $ko_num = $1;
       #open a new file - for overwriting (so we only do this once)
       open ( $output_files{$ko_num}, '>', $ko_num ) or die $! unless $output_files{$ko_num}; 
       #skip printing - could write a header here instead. 
       next;
   }
   #look for a 'K' number. 
   if ( my ($K_id) = $line =~ m/ko:(K\d+)/ ) {
       #skip it if we've already seen this combination of 'ko' number 
       #and k number.    
       next if $seen{$ko_num}{$K_id}++; 
       #print the output to this particular output file. 
       print {$output_files{$ko_num}} $line; 
   }
}
#close the filehandles. 
close ( $_ ) for values %output_files;

따라서 이 방법으로 "myscript.pl file1.txt file2.txt file3.txt"를 실행할 수 있으며 확장 가능한 방식으로 올바른 작업을 수행해야 합니다. 별도의 파일인지 단일 스트림인지는 중요하지 않습니다.

특정 패턴으로 3개 파일의 특정 줄을 병합합니다.

답변1

답변2

답변3

관련 정보