여러 .csv 파일에서 검색된 고유 이름의 빈도를 포함하는 테이블을 만듭니다.

Question

다음은 Python 2 및 3에서 작동하며 다음으로 저장 xyz.py하고 실행해야 합니다
python xyz.py file_1 file_2 file_3.

import sys
import csv

names = set()  # to keep track of all sequence names

files = {}  # map of file_name to dict of sequence_names mapped to counts
# counting
for file_name in sys.argv[1:]:
    # lookup the file_name create a new dict if not in the files dict
    b = files.setdefault(file_name, {})    
    with open(file_name) as fp:
        for line in fp:
            x = line.strip().split()  # split the line 
            names.add(x[1])  # might be a new sequence name
            # retrieve the sequence name or set it if not there yet
            # what would not work is "i += 1" as you would need to assign
            # that to b[x[1]] again. The list "[0]" however is a reference 
            b.setdefault(x[1], [0])[0] += 1  

# output
names = sorted(list(names))  # sort the unique sequence names for the columns
grid = []
# create top line
top_line = ['taxa']
grid.append(top_line)
for name in names:
    top_line.append(name)
# append each files values to the grid
for file_name in sys.argv[1:]:
    data = files[file_name]
    line = [file_name]
    grid.append(line)
    for name in names:
        line.append(data.get(name, [0])[0])  # 0 if sequence name not in file
# dump the grid to CSV
with open('out.csv', 'w') as fp:
    writer = csv.writer(fp)
    writer.writerows(grid)

[0]정수를 직접 사용하는 것보다 카운터를 사용하여 값을 업데이트하는 것이 더 쉽습니다. 입력 파일이 더 복잡한 경우 Python의 CSV 라이브러리를 사용하여 읽는 것이 좋습니다.

Answer 1

다음은 Python 2 및 3에서 작동하며 다음으로 저장 xyz.py하고 실행해야 합니다
python xyz.py file_1 file_2 file_3.

import sys
import csv

names = set()  # to keep track of all sequence names

files = {}  # map of file_name to dict of sequence_names mapped to counts
# counting
for file_name in sys.argv[1:]:
    # lookup the file_name create a new dict if not in the files dict
    b = files.setdefault(file_name, {})    
    with open(file_name) as fp:
        for line in fp:
            x = line.strip().split()  # split the line 
            names.add(x[1])  # might be a new sequence name
            # retrieve the sequence name or set it if not there yet
            # what would not work is "i += 1" as you would need to assign
            # that to b[x[1]] again. The list "[0]" however is a reference 
            b.setdefault(x[1], [0])[0] += 1  

# output
names = sorted(list(names))  # sort the unique sequence names for the columns
grid = []
# create top line
top_line = ['taxa']
grid.append(top_line)
for name in names:
    top_line.append(name)
# append each files values to the grid
for file_name in sys.argv[1:]:
    data = files[file_name]
    line = [file_name]
    grid.append(line)
    for name in names:
        line.append(data.get(name, [0])[0])  # 0 if sequence name not in file
# dump the grid to CSV
with open('out.csv', 'w') as fp:
    writer = csv.writer(fp)
    writer.writerows(grid)

[0]정수를 직접 사용하는 것보다 카운터를 사용하여 값을 업데이트하는 것이 더 쉽습니다. 입력 파일이 더 복잡한 경우 Python의 CSV 라이브러리를 사용하여 읽는 것이 좋습니다.

여러 .csv 파일에서 검색된 고유 이름의 빈도를 포함하는 테이블을 만듭니다.

답변1

관련 정보