파일에서 문자열의 문자 위치를 찾는 방법은 무엇입니까?

Question 1

현재 버전의 Perl에서는 매직 배열을 사용하여 전체 정규식과 가능한 캡처 그룹의 일치하는 위치를 얻을 수 @-있습니다 @+. 두 배열의 0번째 요소 $-[0]는 관심 있는 요소 인 전체 하위 문자열을 기준으로 한 인덱스를 보유합니다 .

한 줄로:

$ echo 'aöæaæaæa' | perl -CSDLA -ne 'BEGIN { $pattern = shift }; printf "%d\n", $-[0] while $_ =~ m/$pattern/g;'  æa
2
4
6

또는 전체 스크립트:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use Encode;
use open  ":encoding(utf8)";
undef $/;
my $pattern = decode_utf8(shift);
binmode STDIN, ":utf8";
while (<STDIN>) {
    printf "%d\n", $-[0] while $_ =~ m/$pattern/g;
}

예를 들어

$ echo 'aöæaæaæa' | perl match.pl æa -
2
4
6

(후자의 스크립트는 표준 입력에서만 작동합니다. Perl이 모든 파일을 UTF-8로 처리하도록 강제할 수는 없는 것 같습니다.)

Answer

현재 버전의 Perl에서는 매직 배열을 사용하여 전체 정규식과 가능한 캡처 그룹의 일치하는 위치를 얻을 수 @-있습니다 @+. 두 배열의 0번째 요소 $-[0]는 관심 있는 요소 인 전체 하위 문자열을 기준으로 한 인덱스를 보유합니다 .

한 줄로:

$ echo 'aöæaæaæa' | perl -CSDLA -ne 'BEGIN { $pattern = shift }; printf "%d\n", $-[0] while $_ =~ m/$pattern/g;'  æa
2
4
6

또는 전체 스크립트:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use Encode;
use open  ":encoding(utf8)";
undef $/;
my $pattern = decode_utf8(shift);
binmode STDIN, ":utf8";
while (<STDIN>) {
    printf "%d\n", $-[0] while $_ =~ m/$pattern/g;
}

예를 들어

$ echo 'aöæaæaæa' | perl match.pl æa -
2
4
6

(후자의 스크립트는 표준 입력에서만 작동합니다. Perl이 모든 파일을 UTF-8로 처리하도록 강제할 수는 없는 것 같습니다.)

Question 2

그리고 zsh:

set -o extendedglob # for (#m) which in patterns causes the matched portion to be
                    # made available in $MATCH and the offset (1-based) in $MBEGIN
                    # (and causes the expansion of the replacement in
                    # ${var//pattern/replacement} to be deferred to the
                    # time of replacement)

haystack=aöæaæaæa
needle=æ

offsets=() i=0
: ${haystack//(#m)$needle/$((offsets[++i] = MBEGIN - 1))}
print -l $offsets

Answer

그리고 zsh:

set -o extendedglob # for (#m) which in patterns causes the matched portion to be
                    # made available in $MATCH and the offset (1-based) in $MBEGIN
                    # (and causes the expansion of the replacement in
                    # ${var//pattern/replacement} to be deferred to the
                    # time of replacement)

haystack=aöæaæaæa
needle=æ

offsets=() i=0
: ${haystack//(#m)$needle/$((offsets[++i] = MBEGIN - 1))}
print -l $offsets

Question 3

GNU awk또는 기타 POSIX 호환 awk구현(이 아님 mawk)과 올바른 로케일 설정을 사용하십시오.

$ LANG='en_US.UTF-8' gawk -v pat='æa' -- '
{
    s = $0;
    pos = 0;
    while (match(s, pat)) {
        pos += RSTART-1;
        print "file", FILENAME ": line", FNR, "position", pos, "matched", substr(s, RSTART, RLENGTH);
        pos += RLENGTH;
        s = substr(s, RSTART+RLENGTH);
    }
}
' <<<'aöæaæaæa'
file -: line 1 position 2 matched æa
file -: line 1 position 4 matched æa
file -: line 1 position 6 matched æa
$

-v pat매개변수에 표시된 패턴은 gawk유효한 정규식일 수 있습니다.

Answer

GNU awk또는 기타 POSIX 호환 awk구현(이 아님 mawk)과 올바른 로케일 설정을 사용하십시오.

$ LANG='en_US.UTF-8' gawk -v pat='æa' -- '
{
    s = $0;
    pos = 0;
    while (match(s, pat)) {
        pos += RSTART-1;
        print "file", FILENAME ": line", FNR, "position", pos, "matched", substr(s, RSTART, RLENGTH);
        pos += RLENGTH;
        s = substr(s, RSTART+RLENGTH);
    }
}
' <<<'aöæaæaæa'
file -: line 1 position 2 matched æa
file -: line 1 position 4 matched æa
file -: line 1 position 6 matched æa
$

-v pat매개변수에 표시된 패턴은 gawk유효한 정규식일 수 있습니다.

파일에서 문자열의 문자 위치를 찾는 방법은 무엇입니까?

답변1

답변2

답변3

관련 정보