로그 파일에서 고유한 줄을 감지합니다.

로그 파일에서 고유한 줄을 감지합니다.

대용량 로그 파일이 있고 특정 줄이 아닌 패턴을 감지하고 싶습니다.

예를 들어:

/path/messages-20181116:11/15/2018 14:23:05.159|worker001|clusterm|I|userx deleted job 5018
/path/messages-20181116:11/15/2018 14:41:25.662|worker001|clusterm|I|userx deleted job 4895
/path/messages-20181116:11/15/2018 14:41:25.673|worker000|clusterm|I|userx deleted job 4890
/path/messages-20181116:11/15/2018 14:41:25.681|worker000|clusterm|I|userx deleted job 4889
11/09/2018 06:18:55.115|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1 
11/09/2018 06:18:55.118|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                
11/09/2018 06:18:55.120|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                
11/09/2018 06:18:55.140|scheduler000|clusterm|P|PROF: job dispatching took 5.005 s (10 fast)             
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 1 job(s)             
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 5 job(s)             
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 3 job(s)             
11/09/2018 06:18:55.145|scheduler000|clusterm|P|PROF: parallel matching   14  0438 107668                 
11/09/2018 06:18:55.148|scheduler000|clusterm|P|PROF: sequential matching  9  0261   8203               
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting :wc =0.006s              
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc=5.005              
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting : wc=0.006s
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc =0.015   

다음과 같이 됩니다:

/path/messages-*NUMBER*:*DATE* *TIME*|worker001|clusterm|I|userx deleted job *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER* 
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)             
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)             
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching   *NUMBER*  *NUMBER* *NUMBER*                 
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching  *NUMBER*  *NUMBER*   *NUMBER*               
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s              
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*    

이렇게 하면 줄 수가 크게 줄어들고 육안으로 로그를 분석/읽기가 훨씬 쉬워집니다.

기본적으로 변경 가능한 단어를 감지하고 일부 기호로 바꿉니다.

답변1

얼마나 멀리 갈 것인가?

sed -r 's~([0-9]{2}/){2}[0-9]{4}~*DATE*~g; s/[0-9:.]{12}/*TIME*/g; s/[0-9.]+/*NUMBER*/g; s/[   ]*$//; ' file4 | uniq 
/path/messages-*NUMBER*:*DATE* *TIME*|worker*NUMBER*|clusterm|I|userx deleted job *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching   *NUMBER*  *NUMBER* *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching  *NUMBER*  *NUMBER*   *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting : wc=*NUMBER*s
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc =*NUMBER*

이해합니다? 어느 정도 집중하고, 동기를 부여하고, 인내심을 갖고 시간을 투자하면 파이프라인을 건너뛰고 uniq완전한 sed솔루션에 도달할 수 있습니다.

관련 정보