두 파일의 4개 열 비교

Question 1

다음을 사용하여 sqlite데이터를 조인할 수 있습니다.

$ sqlite3 <<EOT
.mode csv
CREATE TABLE file1 (A11,A12,A13,A14,A15,A16,A17,A18,A19,A110,A111);
CREATE TABLE file2 (B11,B12,B13,B14);
.import file1 file1
.import file2 file2
SELECT DISTINCT file1.*, ifnull(file2.B14,"NOACT") FROM file1 LEFT JOIN file2 ON abs(julianday(file1.A11) - julianday(file2.B11))*86400.0 < 1 AND A15 = B13;
EOT

이것은 만든다:

19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT

Answer

다음을 사용하여 sqlite데이터를 조인할 수 있습니다.

$ sqlite3 <<EOT
.mode csv
CREATE TABLE file1 (A11,A12,A13,A14,A15,A16,A17,A18,A19,A110,A111);
CREATE TABLE file2 (B11,B12,B13,B14);
.import file1 file1
.import file2 file2
SELECT DISTINCT file1.*, ifnull(file2.B14,"NOACT") FROM file1 LEFT JOIN file2 ON abs(julianday(file1.A11) - julianday(file2.B11))*86400.0 < 1 AND A15 = B13;
EOT

이것은 만든다:

19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT

Question 2

다음 스크립트는 Python에서 찾고 있는 작업을 수행해야 합니다.

#!/usr/bin/env python3

import sys
import csv
import bisect
import re

timeRange = 1 # seconds

# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')

# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
    # We add hours and minutes in case they are missinge
    # We reverse the list
    # We only take the first three elements of the list anyway
    (s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
    result = float(h) * 3600 + float(m) * 60 + float(s)
    return result

if (len(sys.argv) != 4):
    print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
    exit(1)

inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]

# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2) as inputFile2:
    csvReader = csv.reader(inputFile2)
    for row in csvReader:
        if len(row) == 4:
            if not timestampRegex.match(row[0]):
                continue
            time = timeToFloat(row[0])
            file2Entries.append((time, row))

file1Entries = []
with open(inputFileName1) as inputFile1, open(outputFileName, 'w') as outputFile:
    csvReader = csv.reader(inputFile1)
    # For each row in file1 we look for a match among the entries of file2
    for row in csvReader:
        if len(row) == 11:
            if not timestampRegex.match(row[0]):
                # We can't possibly find a match for this line
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')
                continue
            time = timeToFloat(row[0])
            # Find the first and last entries of file2 within a range of 1 second
            a = bisect.bisect_right(file2Entries, (time - timeRange,))
            b = bisect.bisect_left(file2Entries, (time + timeRange,))
            # Loop on the possible matches (those within the given range)
            for entry in file2Entries[a:b]:
                if entry[1][2] == row[4]:
                    outputFile.write(','.join(row + [entry[1][3]]))
                    outputFile.write('\n')
                    break;
            else:
                # We haven't found a match in file2
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')

다음과 같이 저장 compare.py하고 실행할 수 있습니다.

./compare.py input1.txt input2.txt output.txt

추출된 범위가 올바른지 완전히 확신할 수 없으므로 제대로 작동하는지 확인해야 합니다.

고쳐 쓰다: 유효한 타임스탬프가 포함되지 않은 두 번째 입력 파일의 줄은 이제 무시되는 반면, 유효한 타임스탬프가 포함되지 않은 첫 번째 입력 파일의 줄은 이제 끝에 NOACT가 추가된 출력 파일에 복사됩니다.

Answer

다음 스크립트는 Python에서 찾고 있는 작업을 수행해야 합니다.

#!/usr/bin/env python3

import sys
import csv
import bisect
import re

timeRange = 1 # seconds

# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')

# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
    # We add hours and minutes in case they are missinge
    # We reverse the list
    # We only take the first three elements of the list anyway
    (s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
    result = float(h) * 3600 + float(m) * 60 + float(s)
    return result

if (len(sys.argv) != 4):
    print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
    exit(1)

inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]

# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2) as inputFile2:
    csvReader = csv.reader(inputFile2)
    for row in csvReader:
        if len(row) == 4:
            if not timestampRegex.match(row[0]):
                continue
            time = timeToFloat(row[0])
            file2Entries.append((time, row))

file1Entries = []
with open(inputFileName1) as inputFile1, open(outputFileName, 'w') as outputFile:
    csvReader = csv.reader(inputFile1)
    # For each row in file1 we look for a match among the entries of file2
    for row in csvReader:
        if len(row) == 11:
            if not timestampRegex.match(row[0]):
                # We can't possibly find a match for this line
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')
                continue
            time = timeToFloat(row[0])
            # Find the first and last entries of file2 within a range of 1 second
            a = bisect.bisect_right(file2Entries, (time - timeRange,))
            b = bisect.bisect_left(file2Entries, (time + timeRange,))
            # Loop on the possible matches (those within the given range)
            for entry in file2Entries[a:b]:
                if entry[1][2] == row[4]:
                    outputFile.write(','.join(row + [entry[1][3]]))
                    outputFile.write('\n')
                    break;
            else:
                # We haven't found a match in file2
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')

다음과 같이 저장 compare.py하고 실행할 수 있습니다.

./compare.py input1.txt input2.txt output.txt

추출된 범위가 올바른지 완전히 확신할 수 없으므로 제대로 작동하는지 확인해야 합니다.

고쳐 쓰다: 유효한 타임스탬프가 포함되지 않은 두 번째 입력 파일의 줄은 이제 무시되는 반면, 유효한 타임스탬프가 포함되지 않은 첫 번째 입력 파일의 줄은 이제 끝에 NOACT가 추가된 출력 파일에 복사됩니다.

Question 3

새로 편집된 스크립트는 문제가 없지만 오류인 새 줄에서 멈춥니다.

root@localhost:~/python# ./compare.py input1.txt input2.txt output.txt Traceback (most recent call last): File "./compare.py", line 46, in <module> for row in csvReader: File "/usr/lib/python3.2/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 2099: invalid start byte``

input1에서 다음 줄을 찾았는데 다른 줄과 비슷해서 문제를 해결하려고 노력한 결과 다음과 같은 내용을 발견했습니다.

https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character

스크립트를 다음으로 변경했습니다.

#!/usr/bin/env python3

    import sys
    import csv
    import bisect
    import re

timeRange = 1 # seconds

# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')

# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
# We add hours and minutes in case they are missinge
# We reverse the list
# We only take the first three elements of the list anyway
(s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
result = float(h) * 3600 + float(m) * 60 + float(s)
return result

if (len(sys.argv) != 4):
    print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
exit(1)

inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]

# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2, encoding="utf8") as inputFile2:
    csvReader = csv.reader(inputFile2)
    for row in csvReader:

        if len(row) == 4:
            if not timestampRegex.match(row[0]):
                continue
            time = timeToFloat(row[0])
            file2Entries.append((time, row))

file1Entries = []
with open(inputFileName1, encoding="utf8") as inputFile1, open(outputFileName, 'w', encoding="utf8") as outputFile:
csvReader = csv.reader(inputFile1)
# For each row in file1 we look for a match among the entries of file2
for row in csvReader:
        if len(row) == 11:
            if not timestampRegex.match(row[0]):
                # We can't possibly find a match for this line
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')
                continue
            time = timeToFloat(row[0])
            # Find the first and last entries of file2 within a range of 1 second
            a = bisect.bisect_right(file2Entries, (time - timeRange,))
            b = bisect.bisect_left(file2Entries, (time + timeRange,))
            # Loop on the possible matches (those within the given range)
            for entry in file2Entries[a:b]:
                if entry[1][2] == row[4]:
                    outputFile.write(','.join(row + [entry[1][3]]))
                    outputFile.write('\n')
                    break;
            else:
                # We haven't found a match in file2
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')

하지만 여전히 오류가 발생합니다

Answer