반복되는 숫자로 시작하는 정수의 하위 시퀀스 추출

Question 1

원하는 작업을 수행하는 Python 스크립트는 다음과 같습니다.

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""extract_subsequences.py"""

import sys
import re

# Open the file
with open(sys.argv[1]) as file_handle:

    # Read the data from the file
    # Remove white-space and ignore non-integers
    numbers = [
        line.strip()
        for line in file_handle.readlines()
        if re.match("^\d+$", line) 
    ]

    # Set a lower bound so that we can output multiple lists
    lower_bound = 0
    while lower_bound < len(numbers)-1:

        # Find the "start index" where the same number
        # occurs twice at consecutive locations
        start_index = -1 
        for i in range(lower_bound, len(numbers)-1):
            if numbers[i] == numbers[i+1]:
                start_index = i
                break

        # If a "start index" is found, print out the two rows
        # values and the next 10 rows as well
        if start_index >= lower_bound:
            upper_bound = min(start_index+12, len(numbers))
            print(' '.join(numbers[start_index:upper_bound]))

            # Update the lower bound
            lower_bound = start_index + 1

        # If no "start index" is found then we're done
        else:
            break

데이터가 이라는 디렉터리에 있다고 가정하면 data.txt다음과 같이 이 스크립트를 실행할 수 있습니다.

python extract_subsequences.py data.txt

입력 파일이 data.txt다음과 같다고 가정합니다.

그러면 출력은 다음과 같습니다.

1 1 1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10 11

출력을 파일에 저장하려면 출력 리디렉션을 사용하십시오.

python extract_subsequences.py data.txt > output.txt

Answer

원하는 작업을 수행하는 Python 스크립트는 다음과 같습니다.

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""extract_subsequences.py"""

import sys
import re

# Open the file
with open(sys.argv[1]) as file_handle:

    # Read the data from the file
    # Remove white-space and ignore non-integers
    numbers = [
        line.strip()
        for line in file_handle.readlines()
        if re.match("^\d+$", line) 
    ]

    # Set a lower bound so that we can output multiple lists
    lower_bound = 0
    while lower_bound < len(numbers)-1:

        # Find the "start index" where the same number
        # occurs twice at consecutive locations
        start_index = -1 
        for i in range(lower_bound, len(numbers)-1):
            if numbers[i] == numbers[i+1]:
                start_index = i
                break

        # If a "start index" is found, print out the two rows
        # values and the next 10 rows as well
        if start_index >= lower_bound:
            upper_bound = min(start_index+12, len(numbers))
            print(' '.join(numbers[start_index:upper_bound]))

            # Update the lower bound
            lower_bound = start_index + 1

        # If no "start index" is found then we're done
        else:
            break

데이터가 이라는 디렉터리에 있다고 가정하면 data.txt다음과 같이 이 스크립트를 실행할 수 있습니다.

python extract_subsequences.py data.txt

입력 파일이 data.txt다음과 같다고 가정합니다.

그러면 출력은 다음과 같습니다.

1 1 1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10 11

출력을 파일에 저장하려면 출력 리디렉션을 사용하십시오.

python extract_subsequences.py data.txt > output.txt

Question 2

AWK방법:

처음 발견된 동일한 연속 숫자 2개만 고려하므로 다중 추출에 적합하지만, 처리된 슬라이스 아래에 후속 10개의 숫자 시퀀스에 동일한 연속 숫자 2개가 들어갈 수 있는 상황은 고려하지 않습니다.

awk 'NR==n && $1==v{ print v ORS $1 > "file"++c; tail=n+11; next }
     { v=$1; n=NR+1 }NR<tail{ print > "file"c }' file

Answer

AWK방법:

처음 발견된 동일한 연속 숫자 2개만 고려하므로 다중 추출에 적합하지만, 처리된 슬라이스 아래에 후속 10개의 숫자 시퀀스에 동일한 연속 숫자 2개가 들어갈 수 있는 상황은 고려하지 않습니다.

awk 'NR==n && $1==v{ print v ORS $1 > "file"++c; tail=n+11; next }
     { v=$1; n=NR+1 }NR<tail{ print > "file"c }' file

Question 3

첫 번째 변형 - O(n)

awk '
/^[0-9]+$/{
    arr[cnt++] = $0;
}

END {
    for(i = 1; i < cnt; i++) {
        if(arr[i] != arr[i - 1])
            continue;

        last_element = i + 11; 
        for(j = i - 1; j < cnt && j < last_element; j++) {
            printf arr[j] " ";
        }
        print "";
    }
}' input.txt

두 번째 변형 - O(n * n)

awk '
BEGIN {
    cnt = 0;
}

/^[0-9]+$/{
    if(prev == $0) {
        arr[cnt] = prev;
        cnt_arr[cnt]++;
        cnt++;
    }
    
    for(i = 0; i < cnt; i++) {
        if(cnt_arr[i] < 12) {
            arr[i] = arr[i] " " $0; 
            cnt_arr[i]++;
        }
    }

    prev = $0;        
}

END {
    for(i = 0; i < cnt; i++)
        print arr[i];
}' input.txt

산출

1 1 1 2 3 4 4 5 6 7 8 9
1 1 2 3 4 4 5 6 7 8 9 10
4 4 5 6 7 8 9 10 11 12 13 14
15 15 16

Answer

첫 번째 변형 - O(n)

awk '
/^[0-9]+$/{
    arr[cnt++] = $0;
}

END {
    for(i = 1; i < cnt; i++) {
        if(arr[i] != arr[i - 1])
            continue;

        last_element = i + 11; 
        for(j = i - 1; j < cnt && j < last_element; j++) {
            printf arr[j] " ";
        }
        print "";
    }
}' input.txt

두 번째 변형 - O(n * n)

awk '
BEGIN {
    cnt = 0;
}

/^[0-9]+$/{
    if(prev == $0) {
        arr[cnt] = prev;
        cnt_arr[cnt]++;
        cnt++;
    }
    
    for(i = 0; i < cnt; i++) {
        if(cnt_arr[i] < 12) {
            arr[i] = arr[i] " " $0; 
            cnt_arr[i]++;
        }
    }

    prev = $0;        
}

END {
    for(i = 0; i < cnt; i++)
        print arr[i];
}' input.txt

산출

1 1 1 2 3 4 4 5 6 7 8 9
1 1 2 3 4 4 5 6 7 8 9 10
4 4 5 6 7 8 9 10 11 12 13 14
15 15 16

반복되는 숫자로 시작하는 정수의 하위 시퀀스 추출

답변1

답변2

답변3

첫 번째 변형 - O(n)

두 번째 변형 - O(n * n)

관련 정보