Python에서 텍스트 블록을 구문 분석하는 효율적인 방법이 있습니까?

2024-5-27 • tag-icon

나는거대한다음과 같은 줄이 포함된 파일(~70GB):

$ cat mybigfile.txt
5 7  
    1    1    0   -2    0    0    2
    0    4    0   -4    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -1
    0    0    0    0    0    1   -1
5 8  
   -1   -1   -1   -1   -1    1    1    1
    0    0    2    0    0    0   -1   -1
    3    3    3   -1   -1   -1   -1   -1
   -1   -1   -1    0    2    0    0    0
   -1    1   -1    0    0    0    1    0
5 7  
    1    1    0   -2    0    0    5
    0    2    0   -2    0    0    2
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -4
    0    0    0    0    0    1   -4
5 7  
    1    1    0   -2    0    1   -1
    0    2    0   -2    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -2
    0    0    0    0    0    2   -4

파일 헤더의 마지막 문자를 기준으로 각 청크를 구성하여 큰 파일을 여러 개의 작은 파일로 분할하고 싶습니다. 따라서 실행하면 $ python magic.py mybigfile.txt두 개의 새 파일이 생성 v07.txt되고v08.txt

$ cat v07.txt
5 7  
    1    1    0   -2    0    0    2
    0    4    0   -4    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -1
    0    0    0    0    0    1   -1
5 7  
    1    1    0   -2    0    0    5
    0    2    0   -2    0    0    2
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -4
    0    0    0    0    0    1   -4
5 7  
    1    1    0   -2    0    1   -1
    0    2    0   -2    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -2
    0    0    0    0    0    2   -4

$ cat v08.txt
5 8  
   -1   -1   -1   -1   -1    1    1    1
    0    0    2    0    0    0   -1   -1
    3    3    3   -1   -1   -1   -1   -1
   -1   -1   -1    0    2    0    0    0
   -1    1   -1    0    0    0    1    0

각 블록의 헤더는 ~ 5 i의 형식 입니다 .ii=6i=22

이것이 가능한가? 제가 아는 유일한 보급형 언어는 Python이므로 가능하면 Python 솔루션을 선호합니다.

이것이 내 해결책입니다.

from string import whitespace
import sys


class PolyBlock(object):

    def __init__(self, lines):
        self.lines = lines

    def nvertices(self):
        return self.lines[0].split()[-1]

    def outname(self):
        return 'v' + self.nvertices().zfill(2) + '.txt'

    def writelines(self):
        with open(self.outname(), 'a') as f:
            for line in self.lines:
                f.write(line)

    def __repr__(self):
        return ''.join(self.lines)


def genblocks():
    with open('5d.txt', 'r') as f:
        block = [next(f)]
        for line in f:
            if line[0] in whitespace:
                block.append(line)
            else:
                yield PolyBlock(block)
                block = [line]


def main():
    for block in genblocks():
        block.writelines()
        sys.stdout.write(block.__repr__())


if __name__ == '__main__':
    main()

내 솔루션은 각 청크를 반복하고 출력 파일을 반복적으로 열고 닫습니다. 이것이 더 효율적일 것이라고 생각하지만 내 코드를 개선하는 방법을 잘 모르겠습니다.

답변1

awk 명령에 문제가 없다면 다음을 시도해 보십시오...

awk 'NF==2{filename="v0"$2".txt"}{print > filename}' mybigfile.txt

답변1

관련 정보