특정 ID와 일치하는 XML 문서 필터링

Question

일반화하다

저는 Python 솔루션, Bash 솔루션, Awk 솔루션을 작성했습니다. 아이디어는 모든 스크립트에 대해 동일합니다. 한 줄씩 이동하고 플래그 변수를 사용하여 상태(즉, 현재 XML 하위 문서에 있는지 여부와 일치하는 줄을 찾았는지 여부)를 추적합니다.

Python 스크립트에서는 모든 줄을 목록으로 읽고 현재 XML 하위 문서가 시작되는 목록 인덱스를 추적하여 닫는 태그에 도달하면 현재 하위 문서를 인쇄할 수 있습니다. 각 줄의 정규식 패턴을 확인하고 플래그를 사용하여 처리가 완료되면 현재 하위 문서가 출력되는지 여부를 추적합니다.

Bash 스크립트에서는 임시 파일을 버퍼로 사용하여 현재 XML 하위 문서를 저장하고 쓰기가 완료될 때까지 기다렸다가 grep주어진 정규식과 일치하는 줄이 포함되어 있는지 확인합니다.

Awk 스크립트는 Base 스크립트와 유사하지만 파일 대신 Awk 배열을 버퍼로 사용합니다.

테스트 데이터 파일

data.xml귀하의 질문에 제공된 샘플 데이터를 기반으로 다음 데이터 파일( )에 대해 두 스크립트를 모두 확인했습니다.

<a>
  <b>
    string to search for: stuff
  </b>
</a>
in between xml documents there may be plain text log messages
<x>
    unicode string: øæå
</x>

파이썬 솔루션

다음은 원하는 작업을 수행하는 간단한 Python 스크립트입니다.

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""xmlgrep.py"""

import sys
import re

invert_match = False

if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match':
    invert_match = True
    sys.argv.pop(0)

regex = sys.argv[1]

# Open the XML-ish file
with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile:

    # Read all of the data into a list
    lines = xmlfile.readlines()

    # Use flags to keep track of which XML subdocument we're in
    # and whether or not we've found a match in that document
    start_index = closing_tag = regex_match = False

    # Iterate through all the lines
    for index, line in enumerate(lines):

        # Remove trailing and leading white-space
        line = line.strip()

        # If we have a start_index then we're inside an XML document
        if start_index is not False:

            # If this line is a closing tag then reset the flags
            # and print the document if we found a match
            if line == closing_tag:
                if regex_match != invert_match:
                    print(''.join(lines[start_index:index+1]))
                start_index = closing_tag = regex_match = False

            # If this line is NOT a closing tag then we
            # search the current line for a match
            elif re.search(regex, line):
                regex_match = True

        # If we do NOT have a start_index then we're either at the
        # beginning of a new XML subdocument or we're inbetween
        # XML subdocuments
        else:

            # Check for an opening tag for a new XML subdocument
            match = re.match(r'^<(\w+)>$', line)
            if match:

                # Store the current line number
                start_index = index

                # Construct the matching closing tag
                closing_tag = '</' + match.groups()[0] + '>'

문자열 "stuff"를 검색하는 스크립트를 실행하는 방법은 다음과 같습니다.

python xmlgrep.py stuff data.xml

출력은 다음과 같습니다.

<a>
  <b>
    string to search for: stuff
  </b>
</a>

"øæå" 문자열을 검색하는 스크립트를 실행하는 방법은 다음과 같습니다.

python xmlgrep.py øæå data.xml

출력은 다음과 같습니다.

<x>
    unicode string: øæå
</x>

일치하지 않는 문서를 지정 -v하거나 --invert-match검색하고 표준 입력으로 작업할 수도 있습니다.

cat data.xml | python xmlgrep.py -v stuff

쿵쿵 솔루션

이는 동일한 기본 알고리즘을 bash로 구현한 것입니다. 플래그를 사용하여 현재 행이 XML 문서에 속하는지 여부를 추적하고 임시 파일을 버퍼로 사용하여 처리 중인 각 XML 문서를 저장합니다.

#!/usr/bin/env bash
# xmlgrep.sh

# Get the filename and search pattern from the command-line
FILENAME="$1"
REGEX="$2"

# Use flags to keep track of which XML subdocument we're in
XML_DOC=false
CLOSING_TAG=""

# Use a temporary file to store the current XML subdocument
TEMPFILE="$(mktemp)"

# Reset the internal field separator to preserver white-space
export IFS=''

# Iterate through all the lines of the file
while read LINE; do

    # If we're already in an XML subdocument then update
    # the temporary file and check to see if we've reached
    # the end of the document
    if "${XML_DOC}"; then

        # Append the line to the temp-file
        echo "${LINE}" >> "${TEMPFILE}"

        # If this line is a closing tag then reset the flags
        if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then
            XML_DOC=false
            CLOSING_TAG=""

            # Print the document if it contains the match pattern 
            if grep -Pq "${REGEX}" "${TEMPFILE}"; then
                cat "${TEMPFILE}"
            fi
        fi

    # Otherwise we check to see if we've reached
    # the beginning of a new XML subdocument
    elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then

        # Extract the tag-name
        TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<\(\w\+\)>\s*$/\1/;tx;d;:x')"

        # Construct the corresponding closing tag
        CLOSING_TAG="</${TAG_NAME}>"

        # Set the XML_DOC flag so we know we're inside an XML subdocument
        XML_DOC=true

        # Start storing the subdocument in the temporary file 
        echo "${LINE}" > "${TEMPFILE}"
    fi
done < "${FILENAME}"

문자열 "stuff"를 검색하는 스크립트를 실행하는 방법은 다음과 같습니다.

bash xmlgrep.sh data.xml 'stuff'

해당 출력은 다음과 같습니다.

<a>
  <b>
    string to search for: stuff
  </b>
</a>

"øæå" 문자열을 검색하는 스크립트를 실행하는 방법은 다음과 같습니다.

bash xmlgrep.sh data.xml 'øæå'

해당 출력은 다음과 같습니다.

<x>
    unicode string: øæå
</x>

이상한 솔루션

awk해결책 은 다음과 같습니다 . awk내 상태는 그다지 좋지 않아서 거칠습니다. Bash 및 Python 스크립트와 동일한 기본 아이디어를 사용합니다. 각 XML 문서를 버퍼( awk배열)에 저장하고 플래그를 사용하여 상태를 추적합니다. 문서 처리가 끝나면 주어진 정규식과 일치하는 행이 포함된 경우 문서를 인쇄합니다. 스크립트는 다음과 같습니다.

#!/usr/bin/env gawk
# xmlgrep.awk

# Variables:
#
#   XML_DOC
#       XML_DOC=1 if the current line is inside an XML document.
#
#   CLOSING_TAG
#       Stores the closing tag for the current XML document.
#
#   BUFFER_LENGTH
#       Stores the number of lines in the current XML document.
#
#   MATCH
#       MATCH=1 if we found a matching line in the current XML document.
#
#   PATTERN
#       The regular expression pattern to match against (given as a command-line argument).
#

# Initialize Variables
BEGIN{
    XML_DOC=0;
    CLOSING_TAG="";
    BUFFER_LENGTH=0;
    MATCH=0;
}
{
    if (XML_DOC==1) {

        # If we're inside an XML block, add the current line to the buffer
        BUFFER[BUFFER_LENGTH]=$0;
        BUFFER_LENGTH++;

        # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags
        if ($0 ~ CLOSING_TAG) {
            XML_DOC=0;
            CLOSING_TAG="";

            # If there was a match then output the XML document
            if (MATCH==1) {
                for (i in BUFFER) {
                    print BUFFER[i];
                }
            }
        }
        # If we found a matching line then update the MATCH flag
        else {
            if ($0 ~ PATTERN) {
                MATCH=1;
            }
        }
    }
    else {

        # If we reach a new opening tag then start storing the data in the buffer
        if ($0 ~ /<[a-z]+>/) {

            # Set the XML_DOC flag
            XML_DOC=1;

            # Reset the buffer
            delete BUFFER;
            BUFFER[0]=$0;
            BUFFER_LENGTH=1;

            # Reset the match flag
            MATCH=0;

            # Compute the corresponding closing tag
            match($0, /<([a-z]+)>/, match_groups);
            CLOSING_TAG="</" match_groups[1] ">";
        }
    }
}

다음과 같이 호출할 수 있습니다.

gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml

해당 출력은 다음과 같습니다.

<x>
    unicode string: øæå
</x>

Answer 1

일반화하다

저는 Python 솔루션, Bash 솔루션, Awk 솔루션을 작성했습니다. 아이디어는 모든 스크립트에 대해 동일합니다. 한 줄씩 이동하고 플래그 변수를 사용하여 상태(즉, 현재 XML 하위 문서에 있는지 여부와 일치하는 줄을 찾았는지 여부)를 추적합니다.

Python 스크립트에서는 모든 줄을 목록으로 읽고 현재 XML 하위 문서가 시작되는 목록 인덱스를 추적하여 닫는 태그에 도달하면 현재 하위 문서를 인쇄할 수 있습니다. 각 줄의 정규식 패턴을 확인하고 플래그를 사용하여 처리가 완료되면 현재 하위 문서가 출력되는지 여부를 추적합니다.

Bash 스크립트에서는 임시 파일을 버퍼로 사용하여 현재 XML 하위 문서를 저장하고 쓰기가 완료될 때까지 기다렸다가 grep주어진 정규식과 일치하는 줄이 포함되어 있는지 확인합니다.

Awk 스크립트는 Base 스크립트와 유사하지만 파일 대신 Awk 배열을 버퍼로 사용합니다.

테스트 데이터 파일

data.xml귀하의 질문에 제공된 샘플 데이터를 기반으로 다음 데이터 파일( )에 대해 두 스크립트를 모두 확인했습니다.

<a>
  <b>
    string to search for: stuff
  </b>
</a>
in between xml documents there may be plain text log messages
<x>
    unicode string: øæå
</x>

파이썬 솔루션

다음은 원하는 작업을 수행하는 간단한 Python 스크립트입니다.

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""xmlgrep.py"""

import sys
import re

invert_match = False

if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match':
    invert_match = True
    sys.argv.pop(0)

regex = sys.argv[1]

# Open the XML-ish file
with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile:

    # Read all of the data into a list
    lines = xmlfile.readlines()

    # Use flags to keep track of which XML subdocument we're in
    # and whether or not we've found a match in that document
    start_index = closing_tag = regex_match = False

    # Iterate through all the lines
    for index, line in enumerate(lines):

        # Remove trailing and leading white-space
        line = line.strip()

        # If we have a start_index then we're inside an XML document
        if start_index is not False:

            # If this line is a closing tag then reset the flags
            # and print the document if we found a match
            if line == closing_tag:
                if regex_match != invert_match:
                    print(''.join(lines[start_index:index+1]))
                start_index = closing_tag = regex_match = False

            # If this line is NOT a closing tag then we
            # search the current line for a match
            elif re.search(regex, line):
                regex_match = True

        # If we do NOT have a start_index then we're either at the
        # beginning of a new XML subdocument or we're inbetween
        # XML subdocuments
        else:

            # Check for an opening tag for a new XML subdocument
            match = re.match(r'^<(\w+)>$', line)
            if match:

                # Store the current line number
                start_index = index

                # Construct the matching closing tag
                closing_tag = '</' + match.groups()[0] + '>'

문자열 "stuff"를 검색하는 스크립트를 실행하는 방법은 다음과 같습니다.

python xmlgrep.py stuff data.xml

출력은 다음과 같습니다.

<a>
  <b>
    string to search for: stuff
  </b>
</a>

"øæå" 문자열을 검색하는 스크립트를 실행하는 방법은 다음과 같습니다.

python xmlgrep.py øæå data.xml

출력은 다음과 같습니다.

<x>
    unicode string: øæå
</x>

일치하지 않는 문서를 지정 -v하거나 --invert-match검색하고 표준 입력으로 작업할 수도 있습니다.

cat data.xml | python xmlgrep.py -v stuff

쿵쿵 솔루션

이는 동일한 기본 알고리즘을 bash로 구현한 것입니다. 플래그를 사용하여 현재 행이 XML 문서에 속하는지 여부를 추적하고 임시 파일을 버퍼로 사용하여 처리 중인 각 XML 문서를 저장합니다.

#!/usr/bin/env bash
# xmlgrep.sh

# Get the filename and search pattern from the command-line
FILENAME="$1"
REGEX="$2"

# Use flags to keep track of which XML subdocument we're in
XML_DOC=false
CLOSING_TAG=""

# Use a temporary file to store the current XML subdocument
TEMPFILE="$(mktemp)"

# Reset the internal field separator to preserver white-space
export IFS=''

# Iterate through all the lines of the file
while read LINE; do

    # If we're already in an XML subdocument then update
    # the temporary file and check to see if we've reached
    # the end of the document
    if "${XML_DOC}"; then

        # Append the line to the temp-file
        echo "${LINE}" >> "${TEMPFILE}"

        # If this line is a closing tag then reset the flags
        if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then
            XML_DOC=false
            CLOSING_TAG=""

            # Print the document if it contains the match pattern 
            if grep -Pq "${REGEX}" "${TEMPFILE}"; then
                cat "${TEMPFILE}"
            fi
        fi

    # Otherwise we check to see if we've reached
    # the beginning of a new XML subdocument
    elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then

        # Extract the tag-name
        TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<\(\w\+\)>\s*$/\1/;tx;d;:x')"

        # Construct the corresponding closing tag
        CLOSING_TAG="</${TAG_NAME}>"

        # Set the XML_DOC flag so we know we're inside an XML subdocument
        XML_DOC=true

        # Start storing the subdocument in the temporary file 
        echo "${LINE}" > "${TEMPFILE}"
    fi
done < "${FILENAME}"

문자열 "stuff"를 검색하는 스크립트를 실행하는 방법은 다음과 같습니다.

bash xmlgrep.sh data.xml 'stuff'

해당 출력은 다음과 같습니다.

<a>
  <b>
    string to search for: stuff
  </b>
</a>

"øæå" 문자열을 검색하는 스크립트를 실행하는 방법은 다음과 같습니다.

bash xmlgrep.sh data.xml 'øæå'

해당 출력은 다음과 같습니다.

<x>
    unicode string: øæå
</x>

이상한 솔루션

awk해결책 은 다음과 같습니다 . awk내 상태는 그다지 좋지 않아서 거칠습니다. Bash 및 Python 스크립트와 동일한 기본 아이디어를 사용합니다. 각 XML 문서를 버퍼( awk배열)에 저장하고 플래그를 사용하여 상태를 추적합니다. 문서 처리가 끝나면 주어진 정규식과 일치하는 행이 포함된 경우 문서를 인쇄합니다. 스크립트는 다음과 같습니다.

#!/usr/bin/env gawk
# xmlgrep.awk

# Variables:
#
#   XML_DOC
#       XML_DOC=1 if the current line is inside an XML document.
#
#   CLOSING_TAG
#       Stores the closing tag for the current XML document.
#
#   BUFFER_LENGTH
#       Stores the number of lines in the current XML document.
#
#   MATCH
#       MATCH=1 if we found a matching line in the current XML document.
#
#   PATTERN
#       The regular expression pattern to match against (given as a command-line argument).
#

# Initialize Variables
BEGIN{
    XML_DOC=0;
    CLOSING_TAG="";
    BUFFER_LENGTH=0;
    MATCH=0;
}
{
    if (XML_DOC==1) {

        # If we're inside an XML block, add the current line to the buffer
        BUFFER[BUFFER_LENGTH]=$0;
        BUFFER_LENGTH++;

        # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags
        if ($0 ~ CLOSING_TAG) {
            XML_DOC=0;
            CLOSING_TAG="";

            # If there was a match then output the XML document
            if (MATCH==1) {
                for (i in BUFFER) {
                    print BUFFER[i];
                }
            }
        }
        # If we found a matching line then update the MATCH flag
        else {
            if ($0 ~ PATTERN) {
                MATCH=1;
            }
        }
    }
    else {

        # If we reach a new opening tag then start storing the data in the buffer
        if ($0 ~ /<[a-z]+>/) {

            # Set the XML_DOC flag
            XML_DOC=1;

            # Reset the buffer
            delete BUFFER;
            BUFFER[0]=$0;
            BUFFER_LENGTH=1;

            # Reset the match flag
            MATCH=0;

            # Compute the corresponding closing tag
            match($0, /<([a-z]+)>/, match_groups);
            CLOSING_TAG="</" match_groups[1] ">";
        }
    }
}

다음과 같이 호출할 수 있습니다.

gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml

해당 출력은 다음과 같습니다.

<x>
    unicode string: øæå
</x>

특정 ID와 일치하는 XML 문서 필터링

답변1

일반화하다

테스트 데이터 파일

파이썬 솔루션

쿵쿵 솔루션

이상한 솔루션

관련 정보