일부 로그 정보가 포함된 두 개의 csv 파일이 있습니다. field1(file1)의 행과 field2(file2)의 행을 1분 범위(-00:00:01 < time < 00:00:01)로 비교하고 field5(file1)를 field3(file2)과 비교해야 합니다. 일치하면 field3(input2)을 출력에 복사하고 file1의 다음 필드를 인쇄하고, 그렇지 않으면 "NOACT"를 씁니다. awk
시도 했지만 성공하지 join
못했습니다 python
.
파일 1(55,000,000줄):
19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1
파일 2(25000줄):
19:15:30,187.173.121.63,42347,NOT
19:15:30,187.173.121.63,52364,OK
19:15:30,187.173.121.63,52364,OK
19:15:32,145.246.158,44444,NOT
예상 출력:
19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT
예
파일 1:
A11 A12 A13 A14 A15 A16 A17 A18 A19 A110
A21 A22 A23 A24 A25 A26 A27 A28 A29 A210
A31 A32 A33 A34 A35 A36 A37 A38 A39 A310
파일 2:
B11 B12 B13
B21 B22 B23
B31 B32 B33
B11이 A11과 일치하는지 확인하고, 그렇다면 B12가 A15와 일치하는지 확인해야 합니다. 그렇다면 출력의 첫 번째 줄(outline1= inputAline1 && B13)을 쓰고, 그렇지 않으면 B의 다음 줄로 이동합니다. 일치하는 항목이 없으면 A && "NOACT"의 첫 번째 줄이 기록됩니다.
답변1
다음을 사용하여 sqlite
데이터를 조인할 수 있습니다.
$ sqlite3 <<EOT
.mode csv
CREATE TABLE file1 (A11,A12,A13,A14,A15,A16,A17,A18,A19,A110,A111);
CREATE TABLE file2 (B11,B12,B13,B14);
.import file1 file1
.import file2 file2
SELECT DISTINCT file1.*, ifnull(file2.B14,"NOACT") FROM file1 LEFT JOIN file2 ON abs(julianday(file1.A11) - julianday(file2.B11))*86400.0 < 1 AND A15 = B13;
EOT
이것은 만든다:
19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT
답변2
다음 스크립트는 Python에서 찾고 있는 작업을 수행해야 합니다.
#!/usr/bin/env python3
import sys
import csv
import bisect
import re
timeRange = 1 # seconds
# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')
# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
# We add hours and minutes in case they are missinge
# We reverse the list
# We only take the first three elements of the list anyway
(s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
result = float(h) * 3600 + float(m) * 60 + float(s)
return result
if (len(sys.argv) != 4):
print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
exit(1)
inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]
# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2) as inputFile2:
csvReader = csv.reader(inputFile2)
for row in csvReader:
if len(row) == 4:
if not timestampRegex.match(row[0]):
continue
time = timeToFloat(row[0])
file2Entries.append((time, row))
file1Entries = []
with open(inputFileName1) as inputFile1, open(outputFileName, 'w') as outputFile:
csvReader = csv.reader(inputFile1)
# For each row in file1 we look for a match among the entries of file2
for row in csvReader:
if len(row) == 11:
if not timestampRegex.match(row[0]):
# We can't possibly find a match for this line
outputFile.write(','.join(row + ['NOACT']))
outputFile.write('\n')
continue
time = timeToFloat(row[0])
# Find the first and last entries of file2 within a range of 1 second
a = bisect.bisect_right(file2Entries, (time - timeRange,))
b = bisect.bisect_left(file2Entries, (time + timeRange,))
# Loop on the possible matches (those within the given range)
for entry in file2Entries[a:b]:
if entry[1][2] == row[4]:
outputFile.write(','.join(row + [entry[1][3]]))
outputFile.write('\n')
break;
else:
# We haven't found a match in file2
outputFile.write(','.join(row + ['NOACT']))
outputFile.write('\n')
다음과 같이 저장 compare.py
하고 실행할 수 있습니다.
./compare.py input1.txt input2.txt output.txt
추출된 범위가 올바른지 완전히 확신할 수 없으므로 제대로 작동하는지 확인해야 합니다.
고쳐 쓰다: 유효한 타임스탬프가 포함되지 않은 두 번째 입력 파일의 줄은 이제 무시되는 반면, 유효한 타임스탬프가 포함되지 않은 첫 번째 입력 파일의 줄은 이제 끝에 NOACT가 추가된 출력 파일에 복사됩니다.
답변3
새로 편집된 스크립트는 문제가 없지만 오류인 새 줄에서 멈춥니다.
root@localhost:~/python# ./compare.py input1.txt input2.txt output.txt Traceback (most recent call last): File "./compare.py", line 46, in <module> for row in csvReader: File "/usr/lib/python3.2/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 2099: invalid start byte``
input1에서 다음 줄을 찾았는데 다른 줄과 비슷해서 문제를 해결하려고 노력한 결과 다음과 같은 내용을 발견했습니다.
스크립트를 다음으로 변경했습니다.
#!/usr/bin/env python3
import sys
import csv
import bisect
import re
timeRange = 1 # seconds
# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')
# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
# We add hours and minutes in case they are missinge
# We reverse the list
# We only take the first three elements of the list anyway
(s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
result = float(h) * 3600 + float(m) * 60 + float(s)
return result
if (len(sys.argv) != 4):
print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
exit(1)
inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]
# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2, encoding="utf8") as inputFile2:
csvReader = csv.reader(inputFile2)
for row in csvReader:
if len(row) == 4:
if not timestampRegex.match(row[0]):
continue
time = timeToFloat(row[0])
file2Entries.append((time, row))
file1Entries = []
with open(inputFileName1, encoding="utf8") as inputFile1, open(outputFileName, 'w', encoding="utf8") as outputFile:
csvReader = csv.reader(inputFile1)
# For each row in file1 we look for a match among the entries of file2
for row in csvReader:
if len(row) == 11:
if not timestampRegex.match(row[0]):
# We can't possibly find a match for this line
outputFile.write(','.join(row + ['NOACT']))
outputFile.write('\n')
continue
time = timeToFloat(row[0])
# Find the first and last entries of file2 within a range of 1 second
a = bisect.bisect_right(file2Entries, (time - timeRange,))
b = bisect.bisect_left(file2Entries, (time + timeRange,))
# Loop on the possible matches (those within the given range)
for entry in file2Entries[a:b]:
if entry[1][2] == row[4]:
outputFile.write(','.join(row + [entry[1][3]]))
outputFile.write('\n')
break;
else:
# We haven't found a match in file2
outputFile.write(','.join(row + ['NOACT']))
outputFile.write('\n')
하지만 여전히 오류가 발생합니다