sed를 사용하여 표현식에서 문자 제거

2024-5-29 • tag-icon

shell-script shell text-processing sed regular-expression

sed를 사용하여 표현식에서 문자 제거

양식에 문자열이 있습니다.

|a 일부 텍스트, 문자 또는 숫자. |다른 텍스트 문자 또는 숫자|b 텍스트의 다른 부분|c 다른 문자 또는 숫자

막대는 "number.|other"처럼 단독으로 표시되거나 "|a", "|b", "|c" 등의 문자와 함께 표시될 수 있으며 최대 "|z"까지 표시될 수 있습니다.

하지만 그럴 수도 있다

|다른 열 제목 없음

즉, 막대의 개수를 알 수 없습니다.

sed에 사용할 두 가지 정규식을 찾아야 합니다.

첫 번째는 |a와 |b 또는 |b와 |c 사이의 모든 텍스트를 찾는 것입니다.

1)에서 예를 들면,

a| 뒤, b| 앞의 모든 텍스트를 찾으면 다음과 같습니다.

일부 단어, 문자 또는 숫자. |기타 텍스트 문자 또는 숫자

위의 예에서 b| 뒤, c| 이전의 모든 텍스트를 찾습니다.

본문의 다른 부분

|a 뒤의 모든 텍스트를 찾으려면 두 번째 표현식이 필요합니다. 그러나 |b에서 멈추는 대신 단순히 막대만 제거하거나(|) 다른 문자가 있는 막대를 제거하면 |a, |b, |c 등이 삭제됩니다. 함께.

1) 예를 들면:

일부 텍스트, 문자 또는 숫자 기타 텍스트 문자 또는 숫자 텍스트의 다른 부분 기타 문자 또는 숫자

답변1

GNU 유틸리티와 데이터 파일을 가정하면 data,

grep -Po '(?<=\|a).*(?=\|b)' data

 Some text, letters or numbers. | Some other text letters or numbers

sed -r -e 's/^.?*\|a//' -e 's/\|[a-z]?//g' data

 Some text, letters or numbers.  Some other text letters or numbers  some other part of text  some other letters or numbers 
 Title without any other bars

필요에 따라 및 등을 |a변경합니다 .|b|c|d

이들 중 어느 것도 |x마크업 주변의 공백을 제거하지 않으므로 텍스트에 선행 및 후행 공백이 있습니다(둘 중 어느 것도 여기에 표시될 수 없음). 이것도 제거하려면 패턴에 포함해야 합니다.

grep -Po '(?<=\|a ).*(?= \|b)' data
sed -r -e 's/^.?*\|a ?//' -e 's/ ?\|([a-z] ?)?//g' data

여기에 작성된 대로 이 sed명령은 개별 섹션을 함께 결합합니다. 사이에 공백을 두고 싶으면 //끝에 있는 공백을 로 변경 하면 됩니다 / /.

답변2

|a구분 기호의 문자가 연속되기를 원하는지 여부가 명확하지 않으므로 구분 기호가 연속되도록 요구하는 더 어려운 경우(예: AND와 쌍을 이루지 만 |bAND는 아님) 를 처리하기를 원한다고 가정하겠습니다. |c). 정규식만 사용하여 이 작업을 수행할 수 있는지 확실하지 않습니다(적어도 매우 자세한 정규식 없이는 가능). 어쨌든, 이 상황을 처리하는 간단한 Python 스크립트는 다음과 같습니다.

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""parse.py"""

import sys
import re

def extract(string):
    """Removes text between delimters of the form `|START` and `|STOP`
    where START is a single ASCII letter and STOP is the next sequential
    ASCII character (e.g. `|a` and `|b` if START=a and STOP=b or
    `|x` and `|y` if START=x and STOP=y)."""

    # Find the opening delimiter (e.g. '|a' or '|b')
    start_match = re.search(r'\|[a-z]', string)
    start_index = start_match.start()
    start_letter = string[start_index+1]

    # Find the matching closing delimiter
    stop_letter = chr(ord(start_letter) + 1) 
    stop_index = string.find('|' + stop_letter)

    # Extract and return the substring
    substring = string[start_index+2:stop_index]
    return(substring)

def remove(string):

    # Find the opening delimiter (e.g. '|a' or '|b')
    start_match = re.search(r'\|[a-z]', string)
    start_index = start_match.start()
    start_letter = string[start_index+1]

    # Remove everything up to and including the opening delimiter
    string = string[start_index+2:]

    # Remove the desired substrings which occur after the delimiter
    string = re.sub(r'\|[a-z]?', '', string)

    # Return the updated string
    return(string)

if __name__=="__main__":
    input_string = sys.stdin.readline()
    sys.stdout.write(extract(input_string) + '\n')
    sys.stdout.write(remove(input_string))

관련 정보