파일을 여러번 복사하기, 중복된 파일 쓰기, 파일 정렬하기, 정렬 후 특정 줄의 위치 계산하기

Question 1

splitGNU 버전이 있고 , 또는 같은 셸을 coreutils사용할 수 있다고 가정하면(여기서 사용된 프로세스 교체 기능에 대해) 헤더 행과 정렬을 처리하기 위해 이전에 허용된 답변을 수정할 수 있습니다.bashkshzsh

tail -n +2 myUniqueFile | SHELL=$(command -v bash) split -l1 --filter='{ 
  head -n 1 myDuplicationFile &&
    sort -g -r -k4,4 <(tail -n +2 myDuplicationFile) -
  } > "$FILE"'

그런 다음 간단한 줄을 사용하여 출력 파일에서 항목 위치를 찾을 수 있습니다 awk.myUniqueFile

awk 'FNR==NR && NR>1 {a[$0]++; next} ($0 in a) {print FILENAME, FNR}' myUniqueFile xa?
xaa 3
xab 2
xac 4
xad 5
xae 5
xaf 8
xag 9

다른 방법/정렬 순서를 위해 헹구고 반복합니다.

Answer

splitGNU 버전이 있고 , 또는 같은 셸을 coreutils사용할 수 있다고 가정하면(여기서 사용된 프로세스 교체 기능에 대해) 헤더 행과 정렬을 처리하기 위해 이전에 허용된 답변을 수정할 수 있습니다.bashkshzsh

tail -n +2 myUniqueFile | SHELL=$(command -v bash) split -l1 --filter='{ 
  head -n 1 myDuplicationFile &&
    sort -g -r -k4,4 <(tail -n +2 myDuplicationFile) -
  } > "$FILE"'

그런 다음 간단한 줄을 사용하여 출력 파일에서 항목 위치를 찾을 수 있습니다 awk.myUniqueFile

awk 'FNR==NR && NR>1 {a[$0]++; next} ($0 in a) {print FILENAME, FNR}' myUniqueFile xa?
xaa 3
xab 2
xac 4
xad 5
xae 5
xaf 8
xag 9

다른 방법/정렬 순서를 위해 헹구고 반복합니다.

Question 2

이 스크립트는 임시 파일을 생성하지 않고 순위를 계산합니다(거의 하나 생성 sorted_file). 또한 myDuplicationFile각 메서드를 한 번 정렬한 다음 나중에 사용합니다.

#!/bin/bash

rank_determination() {
    # Sorts the "myDuplicationFile" one time
    # The "sorted_file" will be used further.
    ###
    tail -n +2 myDuplicationFile | sort -g -r -k "$1","$1" > sorted_file

    # gawk iterates through "myUniqueFile" line by line (except the first line).
    gawk -v field_number="$1" '
    NR != 1 {
        # Stores the needed value for the each line
        ###
        search_value=$field_number
        cnt=1

        # then, it checks the specified column in the "sorted_file"
        # line by line for the value, which is less than 
        # the "search_value" from the "myUniqueFile".
        ###
        while((getline < "sorted_file") > 0) {
            if($field_number < search_value)
                break
            cnt++
        }

        print cnt
        # closing is needed for reading the file from the beginning
        # each time. Else, "getline" will read line by line consistently.
        ###
        close("sorted_file")
    }' myUniqueFile
}

# I create a function, which takes
# the number argument, which means the column number:
# "4" for "phylop" column, "5" for the "GPS" column.
#
# The function creates output, which you can redirect
# to the needed file.
# Call this function multiple times with different arguments
# for the each needed column.
rank_determination 4 > method1.txt
rank_determination 5 > method2.txt

산출

tail -n +1 -- method*
==> method1.txt <==
2
1
3
4
4
7
8

==> method2.txt <==
2
2
3
5
6
7
8

Answer

이 스크립트는 임시 파일을 생성하지 않고 순위를 계산합니다(거의 하나 생성 sorted_file). 또한 myDuplicationFile각 메서드를 한 번 정렬한 다음 나중에 사용합니다.

#!/bin/bash

rank_determination() {
    # Sorts the "myDuplicationFile" one time
    # The "sorted_file" will be used further.
    ###
    tail -n +2 myDuplicationFile | sort -g -r -k "$1","$1" > sorted_file

    # gawk iterates through "myUniqueFile" line by line (except the first line).
    gawk -v field_number="$1" '
    NR != 1 {
        # Stores the needed value for the each line
        ###
        search_value=$field_number
        cnt=1

        # then, it checks the specified column in the "sorted_file"
        # line by line for the value, which is less than 
        # the "search_value" from the "myUniqueFile".
        ###
        while((getline < "sorted_file") > 0) {
            if($field_number < search_value)
                break
            cnt++
        }

        print cnt
        # closing is needed for reading the file from the beginning
        # each time. Else, "getline" will read line by line consistently.
        ###
        close("sorted_file")
    }' myUniqueFile
}

# I create a function, which takes
# the number argument, which means the column number:
# "4" for "phylop" column, "5" for the "GPS" column.
#
# The function creates output, which you can redirect
# to the needed file.
# Call this function multiple times with different arguments
# for the each needed column.
rank_determination 4 > method1.txt
rank_determination 5 > method2.txt

산출

tail -n +1 -- method*
==> method1.txt <==
2
1
3
4
4
7
8

==> method2.txt <==
2
2
3
5
6
7
8

Question 3

@WeijunZhou가 그의 의견에서 말한 내용에 동의합니다. 이 작업을 수행하기 위해 이러한 임시 파일을 모두 만들 필요는 없습니다.

다음 Perl 스크립트는 한 번에 두 개의 파일을 반복하면서 방법 1(phylops) 및 방법 2(GPS) 정렬에 대한 개수를 계산합니다.

이는 중복 파일에 필롭 및 GPS 값의 정렬된 목록(배열)을 유지한 다음 (고유 파일의 각 행에 대해) 필롭 및 GPS 값이 있을 각각의 정렬된 배열에서 위치를 계산하는 방식으로 작동합니다. 정렬되었습니다.

#!/usr/bin/perl

use strict;

# get uniqfile and dupefile names from cmd line, with defaults
my $uniqfile = shift || 'myUniqueFile';
my $dupefile = shift || 'myDuplicationFile';

# Read in the dupefile and keep the phylops and GPS values.
# This could take a LOT of memory if dupefile is huge.
# Most modern systems should have no difficulty coping with even
# a multi-gigabyte dupefile.
my @phylop=();
my @GPS=();

open(DUPE,"<",$dupefile) || die "couldn't open '$dupefile': $!\n";
while(<DUPE>) {
  chomp;
  next if (m/^chromosoom/);

  my($chr,$start,$end,$phylop,$GPS) = split;
  push @phylop, $phylop + 0; # add 0 to make sure we only ever store a number
  push @GPS, $GPS + 0;
};
close(DUPE);

# Sort the @phylop and @GPS arrays, numerically descending
@phylop = sort {$a <=> $b} @phylop;
@GPS = sort {$a <=> $b} @GPS;

print "Method1\tMethod2\n";

# Now find out where the phylop and GPS value from each line of uniqfile
# would have ended up if we had sorted it into dupefile
open(UNIQ,"<",$uniqfile) || die "couldn't open '$uniqfile': $!\n";
while (<UNIQ>) {
  next if (m/^chromosoom/);
  chomp;

  my $phylop_sort_line=1;
  my $GPS_sort_line=1;

  my($chr,$start,$end,$phylop,$GPS) = split;

  for my $i (0..@phylop-1) {
    $phylop_sort_line++ if ($phylop < $phylop[$i]);
    $GPS_sort_line++ if ($GPS < $GPS[$i]);
  };

  #printf "%i\t%i\t#%s\n", $phylop_sort_line, $GPS_sort_line, $_;
  printf "%i\t%i\n", $phylop_sort_line, $GPS_sort_line;  
};
close(UNIQ);

위에 제공한 샘플 데이터에 대해 실행하면 출력은 다음과 같습니다.

$ ./counts-for-methods.pl
Method1 Method2
2       1
1       1
3       2
4       3
4       5
7       7
8       7

스크립트는 두 파일의 헤더 줄을 완전히 무시하므로 현재 알고리즘이 해당 줄 번호를 계산하는 경우 해당 줄 번호는 하나가 줄어들 수 있습니다.

또한 고유 파일의 값은 항상 중복 파일의 동일한 값 바로 옆에 정렬된다고 가정합니다. 원하는 것이 아니라면 루프 <의 비교 for my $i (0..@phylop)를 <=.

방법 1과 방법 2 각각에 대한 값이 필요한 경우 를 사용하여 쉽게 추출할 수 있습니다 awk. 또는 perl스크립트를 쉽게 수정하여 각 메소드마다 하나씩 두 개의 출력 파일을 열고 해당 값을 각 파일에 인쇄할 수 있습니다.

입력 행의 151개 필드를 처리하는 버전입니다. 그런 입력 파일이 없어서 코드에 주석 처리된 "5필드 버전"을 사용하여 테스트했습니다. 출력은 위 버전과 동일합니다.

#!/usr/bin/perl

use strict;

# get uniqfile and dupefile names from cmd line, with defaults
my $uniqfile = shift || 'myUniqueFile';
my $dupefile = shift || 'myDuplicationFile';

my @phylop=();
my @GPS=();

# Read in the dupefile and keep the phylops and GPS values.
# This could take a LOT of memory if dupefile is huge.
# Most modern systems should have no difficulty coping with even
# a multi-gigabyte dupefile.
open(DUPE,"<",$dupefile) || die "couldn't open '$dupefile': $!\n";
while(<DUPE>) {
  chomp;
  next if (m/^chromosoom/);

  my @fields = split;

# 151 fields version:
  push @phylop, $fields[42]+0;
  push @GPS, $fields[150]+0;

# 5 fields version:
#  push @phylop, $fields[3]+0;
#  push @GPS, $fields[4]+0;

};
close(DUPE);

# Sort the @phylop and @GPS arrays, numerically descending
@phylop = sort {$b <=> $a} @phylop;
@GPS = sort {$b <=> $a} @GPS;

print "Method1\tMethod2\n";

# Now find out where the phylop and GPS from each line of uniqfile
# would have ended up if we had sorted it into the dupefile
open(UNIQ,"<",$uniqfile) || die "couldn't open '$uniqfile': $!\n";
while (<UNIQ>) {
  next if (m/^chromosoom/);
  chomp;

  my $phylop_sort_line=1;
  my $GPS_sort_line=1;

  my @fields = split;

  for my $i (0..@phylop-1) {

# 151 fields version:
    $phylop_sort_line++ if ($fields[42] < $phylop[$i]);
    $GPS_sort_line++ if ($fields[150] < $GPS[$i]);

# 5 fields version:
#    $phylop_sort_line++ if ($fields[3] < $phylop[$i]);
#    $GPS_sort_line++ if ($fields[4] < $GPS[$i]);
  };

  #printf "%i\t%i\t#%s\n", $phylop_sort_line, $GPS_sort_line, $_;
  printf "%i\t%i\n", $phylop_sort_line, $GPS_sort_line;

};
close(UNIQ);

Answer