awk: 구문 분석하고 다른 파일에 쓰기

Question 1

귀하가 게시한 내용은 유효한 XML이 아니기 때문에 예제라고 가정합니다. 이 가정이 유효하지 않다면 내 대답은 사실이 아닙니다... 그렇다면 XML 사양의 요약 사본과 함께 XML을 제공한 사람에게 연락하여 '수정'을 요청해야 합니다.

그러나 실제로 awk정규식은 작업에 적합한 도구가 아닙니다. XML 파서는 다음과 같습니다. 파서를 사용하면 원하는 작업을 매우 쉽게 수행할 수 있습니다.

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig; 

#parse your file - this will error if it's invalid. 
my $twig = XML::Twig -> new -> parsefile ( 'your_xml' );
#set output format. Optional. 
$twig -> set_pretty_print('indented_a');

#iterate all the 'record' nodes off the root. 
foreach my $record ( $twig -> get_xpath ( './record' ) ) {
   #if - beneath this record - we have a node anywhere (that's what // means)
   #with a tag of 'keyword' and content of 'SEARCH' 
   #print the whole record. 
   if ( $record -> get_xpath ( './/keyword[string()="SEARCH"]' ) ) {
       $record -> print;
   }
}

xpath어떤 면에서는 정규식과 비슷하지만 디렉터리 경로에 더 가깝습니다. 이는 컨텍스트를 인식하고 XML 구조를 처리할 수 있음을 의미합니다.

위: ./"현재 노드 아래"를 의미하므로 다음과 같습니다.

$twig -> get_xpath ( './record' )

"최상위" <record>태그를 나타냅니다.

그러나 .//"현재 노드 아래의 모든 수준"을 의미하므로 이 작업을 재귀적으로 수행합니다.

$twig -> get_xpath ( './/search' )

<search>모든 레벨의 모든 노드를 얻을 수 있습니다 .

대괄호는 조건을 나타냅니다. 이는 함수(예: text()노드의 텍스트 가져오기)이거나 속성을 사용할 수 있습니다. 예를 들어 //category[@name]이름 속성이 있는 모든 카테고리를 찾아 //category[@name="xyz"]추가로 필터링합니다.

테스트용 XML:

<XML>
<record category="xyz">
<person ssn="" e-i="E">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<details>
<names>
<first_name/>
<last_name></last_name>
</names>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>SEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is present in abc for xyz reason</detail>
</external_sources>
</details>
</person>
</record>
<record category="abc">
<person ssn="" e-i="F">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<details>
<names>
<first_name/>
<last_name></last_name>
</names>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>DONTSEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is not present in abc for xyz reason</detail>
</external_sources>
</details>
</person>
</record>
</XML>

산출:

 <record category="xyz">
    <person
        e-i="E"
        ssn="">
      <title xsi:nil="true" />
      <position xsi:nil="true" />
      <details>
        <names>
          <first_name/>
          <last_name></last_name>
        </names>
        <aliases>
          <alias>CDP</alias>
        </aliases>
        <keywords>
          <keyword xsi:nil="true" />
          <keyword>SEARCH</keyword>
        </keywords>
        <external_sources>
          <uri>http://www.google.com</uri>
          <detail>SEARCH is present in abc for xyz reason</detail>
        </external_sources>
      </details>
    </person>
  </record>

참고 - 위의 내용은 레코드를 STDOUT으로 인쇄합니다. 사실...제 생각엔 별로 좋은 생각이 아닌 것 같아요. 특히 XML 구조를 인쇄하지 않으므로 ("루트" 노드 없이) 여러 레코드가 있는 경우 실제로 "유효한" XML이 아닙니다.

그래서 나는 - 당신이 요청한 것을 정확히 수행할 것입니다:

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig; 

my $twig = XML::Twig -> new -> parsefile ('your_file.xml'); 
$twig -> set_pretty_print('indented_a');

foreach my $record ( $twig -> get_xpath ( './record' ) ) {
   if ( not $record -> findnodes ( './/keyword[string()="SEARCH"]' ) ) {
       $record -> delete;
   }
}

open ( my $output, '>', "output.txt" ) or die $!;
print {$output} $twig -> sprint;
close ( $output );

대신 이는 논리를 뒤집고 (메모리에서 구문 분석된 데이터 구조에서) 레코드를 제거합니다.아니요원하는 경우 전체 새 구조(XML 헤더 포함)를 "output.txt"라는 새 파일에 인쇄합니다.

Answer