SQL과 같은 CSV 파일 쿼리

Question 1

이를 달성하기 위해 몇 가지 도구가 만들어졌습니다. 예는 다음과 같습니다.

$ csvq 'select * from cities'
+------------+-------------+----------+
|    name    |  population |  country |
+------------+-------------+----------+
| warsaw     |  1700000    |  poland  |
| ciechanowo |  46000      |  poland  |
| berlin     |  3500000    |  germany |
+------------+-------------+----------+

$ csvq 'insert into cities values("dallas", 1, "america")'
1 record inserted on "C:\\cities.csv".
Commit: file "C:\\cities.csv" is updated.

https://github.com/mithrandie/csvq

Answer

이를 달성하기 위해 몇 가지 도구가 만들어졌습니다. 예는 다음과 같습니다.

$ csvq 'select * from cities'
+------------+-------------+----------+
|    name    |  population |  country |
+------------+-------------+----------+
| warsaw     |  1700000    |  poland  |
| ciechanowo |  46000      |  poland  |
| berlin     |  3500000    |  germany |
+------------+-------------+----------+

$ csvq 'insert into cities values("dallas", 1, "america")'
1 record inserted on "C:\\cities.csv".
Commit: file "C:\\cities.csv" is updated.

https://github.com/mithrandie/csvq

Question 2

면접질문이라고 하셨는데요. 인터뷰에서 이런 질문을 받으면 이러한 제한 사항에 대해 질문할 것입니다. 예를 들어 이러한 제한 사항이 있는 이유, 허용되는 것과 허용되지 않는 것, 이유 등을 묻습니다. 각 질문에 대해 저는 여기서 무슨 일이 일어나고 있는지 실제로 이해하기 위해 비즈니스 환경에 한계가 있는 이유를 다시 연결하려고 노력합니다.

그리고 동물의 속도 공식의 유래에 대해 물어보고 싶었는데, 그건 단지 제 생명과학 배경보다 물리과학 배경이 더 강하고 궁금해서 그럴 뿐입니다.

면접관으로서 저는 CSV 구문 분석을 위한 표준 도구가 있다는 말을 꼭 듣고 싶습니다. 처음부터 구문 분석/수정하기 위해 스크립트나 명령줄 유틸리티를 사용하는 것이 pandas및 csv.

Stack Exchange는 이러한 유형의 반복적인 Q&A에 적합하지 않으므로 Python을 사용하여 답변을 게시할 것입니다. 답변은 비즈니스 문제를 진정으로 이해한 후에만 인터뷰에서 제공할 것입니다.

# Assume it's OK to import sqrt, otherwise the spirit of the problem isn't understood.
from math import sqrt

# Read data into dictionary.
dino_dict = dict()
for filename in ['file1.csv','file2.csv']:
    with open(filename) as f:
        # Read the first line as the CSV headers/labels.
        labels = f.readline().strip().split(',')

        # Read the data lines.
        for line in f.readlines():
            values = line.strip().split(',')
        
            # For each line insert the data in the dict.
            for label, value in zip(labels, values):
                if label == "NAME":
                    dino_name = value
                    if dino_name not in dino_dict:
                        dino_dict[dino_name] = dict() # New dino.
                else:
                    dino_dict[dino_name][label] = value # New attribute.

# Calculate speed and insert into dictionary.
for dino_stats in dino_dict.values():
    try:
        stride_length = float(dino_stats['STRIDE_LENGTH'])
        leg_length = float(dino_stats['LEG_LENGTH'])
    except KeyError:
        continue
    
    dino_stats["SPEED"] = ((stride_length / leg_length) - 1) * sqrt(leg_length * 9.8)
    
# Make a list of dinos with their speeds.
bipedal_dinos_with_speed = list()
for dino_name, dino_stats in dino_dict.items():
    if dino_stats.get('STANCE') == 'bipedal':
        if 'SPEED' in dino_stats:
            bipedal_dinos_with_speed.append((dino_name, dino_stats['SPEED']))

# Sort the list by speed and print the dino names.
[dino_name for dino_name, _ in sorted(bipedal_dinos_with_speed, key=lambda x: x[1], reverse=True)]

['티라노사우루스렉스', '벨로시랩터', '타조', '오리부리공룡']

Answer

면접질문이라고 하셨는데요. 인터뷰에서 이런 질문을 받으면 이러한 제한 사항에 대해 질문할 것입니다. 예를 들어 이러한 제한 사항이 있는 이유, 허용되는 것과 허용되지 않는 것, 이유 등을 묻습니다. 각 질문에 대해 저는 여기서 무슨 일이 일어나고 있는지 실제로 이해하기 위해 비즈니스 환경에 한계가 있는 이유를 다시 연결하려고 노력합니다.

그리고 동물의 속도 공식의 유래에 대해 물어보고 싶었는데, 그건 단지 제 생명과학 배경보다 물리과학 배경이 더 강하고 궁금해서 그럴 뿐입니다.

면접관으로서 저는 CSV 구문 분석을 위한 표준 도구가 있다는 말을 꼭 듣고 싶습니다. 처음부터 구문 분석/수정하기 위해 스크립트나 명령줄 유틸리티를 사용하는 것이 pandas및 csv.

Stack Exchange는 이러한 유형의 반복적인 Q&A에 적합하지 않으므로 Python을 사용하여 답변을 게시할 것입니다. 답변은 비즈니스 문제를 진정으로 이해한 후에만 인터뷰에서 제공할 것입니다.

# Assume it's OK to import sqrt, otherwise the spirit of the problem isn't understood.
from math import sqrt

# Read data into dictionary.
dino_dict = dict()
for filename in ['file1.csv','file2.csv']:
    with open(filename) as f:
        # Read the first line as the CSV headers/labels.
        labels = f.readline().strip().split(',')

        # Read the data lines.
        for line in f.readlines():
            values = line.strip().split(',')
        
            # For each line insert the data in the dict.
            for label, value in zip(labels, values):
                if label == "NAME":
                    dino_name = value
                    if dino_name not in dino_dict:
                        dino_dict[dino_name] = dict() # New dino.
                else:
                    dino_dict[dino_name][label] = value # New attribute.

# Calculate speed and insert into dictionary.
for dino_stats in dino_dict.values():
    try:
        stride_length = float(dino_stats['STRIDE_LENGTH'])
        leg_length = float(dino_stats['LEG_LENGTH'])
    except KeyError:
        continue
    
    dino_stats["SPEED"] = ((stride_length / leg_length) - 1) * sqrt(leg_length * 9.8)
    
# Make a list of dinos with their speeds.
bipedal_dinos_with_speed = list()
for dino_name, dino_stats in dino_dict.items():
    if dino_stats.get('STANCE') == 'bipedal':
        if 'SPEED' in dino_stats:
            bipedal_dinos_with_speed.append((dino_name, dino_stats['SPEED']))

# Sort the list by speed and print the dino names.
[dino_name for dino_name, _ in sorted(bipedal_dinos_with_speed, key=lambda x: x[1], reverse=True)]

['티라노사우루스렉스', '벨로시랩터', '타조', '오리부리공룡']

Question 3

훌륭하게 활용하실 수 있습니다밀러그리고 달리다

mlr --csv join -j NAME -f file1.csv \
then put '$speed=($STRIDE_LENGTH/LEG_LENGTH - 1)*pow(($LEG_LENGTH*9.8),0.5)' \
then sort -nr speed \
then cut -f NAME file2.csv

얻다

NAME
Tyrannosaurus Rex
Velociraptor
Euoplocephalus
Stegosaurus
Hadrosaurus
Struthiomimus

Bash(및 기타 스크립팅 언어)를 통해 거의 모든 운영 체제 및 스크립트에서 사용할 수 있습니다. 잘라내기/붙여넣기/sed/awk와 같습니다.

Answer

훌륭하게 활용하실 수 있습니다밀러그리고 달리다

mlr --csv join -j NAME -f file1.csv \
then put '$speed=($STRIDE_LENGTH/LEG_LENGTH - 1)*pow(($LEG_LENGTH*9.8),0.5)' \
then sort -nr speed \
then cut -f NAME file2.csv

얻다

NAME
Tyrannosaurus Rex
Velociraptor
Euoplocephalus
Stegosaurus
Hadrosaurus
Struthiomimus

Bash(및 기타 스크립팅 언어)를 통해 거의 모든 운영 체제 및 스크립트에서 사용할 수 있습니다. 잘라내기/붙여넣기/sed/awk와 같습니다.

Question 4

A g는 awk내부적으로 초기 정렬 join과 최종 정렬을 수행합니다.awk

join -t, <(sort file1.csv) <(sort file2.csv) | 
    awk -F, -v g=9.8 '/bipedal/{osaur[$1]=($4/$2-1)*sqrt(g*$2)}
        END{PROCINFO["sorted_in"]="@val_num_desc"; for (d in osaur) print d}'

Tyrannosaurus Rex
Velociraptor
Struthiomimus
Hadrosaurus

@Cbhihe 댓글 편집됨

제어 방법에 대한 유용한 리소스gawk스캔 배열.

PROCINFO["sorted_in"]배열을 읽는 순서를 제어하도록 설정할 수 있습니다.

이 경우 @value를 사용하고 eric이라고 가정 하고 끝까지 num정렬하므로 .desc@val_num_desc

ices를 사용하여 배열을 출력할 수도 있습니다. 이 경우 배열 @ind은 ings라고 가정 str하고 정렬 합니다.asc@ind_str_asc

이러한 파리의 조합과 모든 파리는 연결된 리소스에 있습니다.

Answer