문자로만 구성된 전체 텍스트를 바이그램으로 분할하는 방법입니다.
예를 들어:
홀수와 짝수
od -> de
dd -> ev
이것이 내가 지금까지 가지고 있는 것이지만 예상한 결과를 얻지 못합니다.
[some source] | tail -n +30 | sed 's/[^A-Za-z\n]//g' | sed 's/\([A-Za-z]\)\([A-Za-z]\)\([A-Za-z]\)\([A-Za-z]\)\([A-Za-z]\)\{\0,1\}/\1\2 -> \3\4, \2\3 -> \4\5, /g' | sed 's/,/,\n/g'
답변1
다음 sed
스크립트를 시도해 보세요:
콘텐츠 infile
:
odd even
one test of bigrams
콘텐츠 script.sed
:
## Inside square brackets there are two characters: space and tab.
## The instruction deletes them of the line.
s/[ ]*//g
## Label 'b'.
:b
## Copy line to 'hold space'.
h
## Get first bigram.
s/\(..\)\(..\).*/\1 -> \2/
## If last substitution succeed, continue to label 'a'.
ta
## Here last substitution failed: It means that line has less than four
## characters to extract a bigram, so read next line.
b
## Label 'a'
:a
## Print.
p
## Copy 'hold space' into 'pattern space'.
g
## Delete first character.
s/^.//
## Goto label 'b' to repeat loop.
tb
스크립트를 실행합니다:
sed -nf script.sed infile
결과:
od -> de
dd -> ev
de -> ve
ev -> en
on -> et
ne -> te
et -> es
te -> st
es -> to
st -> of
to -> fb
of -> bi
fb -> ig
bi -> gr
ig -> ra
gr -> am
ra -> ms
답변2
이것은 당신에게 도움이 될 수 있습니다:
echo -e "od\ndd\nde\nve" |
sed '1{x;s/^/oddevenodd/;x};G;/^\(..\)\n.*\1\(..\).*/s//\1 -> \2/'
od -> de
dd -> ev
de -> ve
ve -> no
이게 네가 말하는거야?
echo -e "odd even\nthis and that" |
sed 's/ //g;s/^\(..\)\(.*\)/\1\2\1/;h;:a;s/^\(..\)\(..\).*/\1 -> \2/p;g;/^..../{s/^..//;h;ba};d'
od -> de
de -> ve
ve -> no
th -> is
is -> an
an -> dt
dt -> ha
ha -> tt