텍스트.txt

텍스트.txt

많은 HTML 파일을 결합하여 만든 큰 텍스트 파일이 있습니다.

cat *.html > all_html_files.txt

텍스트 파일 안에는 다른 텍스트 파일로 추출하려는 특정 문자열이 있습니다. 예를 들어:

book title>The Edge of the Round World< font 32 - extra

>기호와 사이에 나타나는 모든 텍스트를 추출하고 싶습니다 <.

The Edge of the Round World문서에서 동일한 기호 사이에 나타나는 다른 모든 문자열을 추출하고 싶습니다 .

해결책을 찾으려고 노력했지만 무엇을 교체해야 할지 정확히 알 수 없었기 때문에 찾은 명령을 적용할 수 없었습니다. 논리를 제대로 파악할 수 없었습니다.

이 포럼 덕분에 저는 sed와 awk의 사용법에 다시 익숙해졌습니다.

답변1

sed -ne's/<\([^>"]*\("[^"]*"\)*\)*\)*>//g;/./p' <infile >outfile

...GNU 또는 BSD 사용 sed:

sed -Ene's/<([^>"]*("[^"]*")*)*>//g;/./p' <infile >outfile

개념 증명으로 더 복잡한 내용은 다음과 같습니다.


url='http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags'
curl -s  "$url"   |
sed -Ene:n -etD   \
        -e's/ ?[^ "]*"[^"]*"//g;/"/'bN  \
        -e's/[[:space:]]*($|<)/\n\1/'   \
        -e'/^Moderator.s Note/q'        \
        -e'/.\n/P;/\n</!t'        -e:D  \
        -e'/\n/D;/^<script>/!s/>/&\n/'  \
        -e'/\n/!s/<\/script>/\n/' -e:N  \
        -e'/\n/!{N;s///;}' -e//tD -etn

가장 어려운 부분은 모든 자바스크립트를 필터링하는 것입니다.


html - RegEx match open tags except XHTML self-contained tags - Stack Overflow
current community
chat
        Stack Overflow
        Meta Stack Overflow
                        Stack Overflow Careers
your communities
Sign up
 or
log in
 to customize your list.
more stack exchange communities
company blog
Stack Exchange
Inbox
Reputation and Badges
sign up
log in
tour
        help
                            Tour
                                Start here for a quick overview of the site
                        Help Center
                            Detailed answers to any questions you might have
                            Meta
                                Discuss the workings and policies of this site
                    Stack Overflow
Questions
Jobs
beta
Tags
Users
Badges
Ask Question
Sign up
&times;
            Stack Overflow is a community of 4.7 million programmers, just like you,
 helping each other. Join them; it only takes a minute:
RegEx match open tags except XHTML self-contained tags
up vote
1326
down vote
favorite
4475
I need to match all of these opening tags:
&lt;p&gt;
&lt;a&gt;
But not these:
&lt;br /&gt;
&lt;hr /&gt;
I came up with this and wanted to make sure I've got it right. I am only capturing t
he
a-z
.
&lt;([a-z]+) *[^/]*?&gt;
I believe it says:
Find a less-than, then
Find (and capture) a-z one or more times, then
Find zero or more spaces, then
Find any character zero or more times, greedy, except
/
, then
Find a greater-than
Do I have that right? And more importantly, what do you think?
html
regex
xhtml
share
edited
May 26 '12 at 20:37
            community wiki
        11 revs, 7 users 58%
Jeff
locked
 by
Robert Harvey
&#9830;
Jun 7 '12 at 19:41
This post has been locked due to the high amount of off-topic comments generated. Fo
r extended discussions, please use
chat
.
comments disabled on deleted / locked posts / reviews
&nbsp;|&nbsp;
                                35 Answers
35
            active
            oldest
            votes
1
2
 next
up vote
4427
down vote
accepted
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is
not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-
regex questions here so many times before, the use of regex will not allow you to co
nsume HTML. Regular expressions are a tool that is insufficiently sophisticated to u
nderstand the constructs employed by HTML. HTML is not a regular language and hence
cannot be parsed by regular expressions. Regex queries are not equipped to break dow
n HTML into its meaningful parts. so many times but it is not getting to me. Even en
hanced irregular regular expressions as used by Perl are not up to the task of parsi
ng HTML. You will never make me crack. HTML is a language of sufficient complexity t
hat it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML usi
ng regular expressions. Every time you attempt to parse HTML with regular expression
s, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.
 Parsing HTML with regex summons tainted souls into the realm of the living. HTML an
d regex go together like love, marriage, and ritual infanticide. The &lt;center> can
not hold it is too late. The force of regex and HTML together in the same conceptual
 space will destroy your mind like so much watery putty. If you parse HTML with rege
x you are giving in to Them and their blasphemous ways which doom us all to inhuman
toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he
comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe,
 your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are t
he cancer that is killing StackOverflow
it is too late it is too late we cannot be saved
 the trangession of a chi͡ld ensures regex will consume all living tissue (except fo
r HTML which it cannot, as previously prophesied)
dear lord help us how can anyone survive this scourge
 using regex to parse HTML has doomed humanity to an eternity of dread torture and s
ecurity holes
using rege
x as a tool to process HTML establishes a brea
ch between this world
 and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but
more corrupt) a mere glimp
se of the world of reg​
ex parsers for HTML will ins
​tantly transport a p
rogrammer's consciousness i
nto a w
orl
d of ceaseless screaming, he comes
, the pestilent sl
ithy regex-infection wil​
l devour your HT
​ML parser, application and existence for all time like Visual Basic only worse
he comes he com
es
do not fi
​ght h
e com̡e̶s, ̕h̵i
​s un̨ho͞ly radiańcé de
stro҉ying all enli̍̈́̂̈́ghtenment, HTML tags
lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq
​uid p
ain, the song of re̸gular exp​re
ssion parsing
will exti
​nguish the voices of mor​
tal man from the sp
​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​
he f
inal snuf
fing o
f the lie​
s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T A
LL I​S L
OST th
e pon̷y he come
s he c̶̮om
es he co
me
s t
he
 ich​
or permeat
es al
l MY FAC
E MY FACE ᵒh god n
o NO NOO̼
O​O N
Θ stop t
he an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨ
e̠̅s
 ͎a̧͈͖r̽̾̈́͒͑e
 n
​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ T
O͇̹̺ͅƝ̴ȳ̳ TH̘
Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝
S̨̥̫͎̭ͯ̿̔̀ͅ
Have you tried using an XML parser instead?

답변2

나는 이런 종류의 작업에 grep과 Perl 정규식을 사용하는 것을 좋아합니다. 당신은 이것을 시도하고 싶을 수도 있습니다

grep -oP '(?<=book title>).*(?=<)' all_html_files.txt

답변3

HTML에서 정보를 추출하기 위해 정규식을 사용하는 것은 좋은 생각이 아닙니다. 특히 구문 요소가 파일의 줄에 걸쳐 있을 수 있는 경우에는 더욱 그렇습니다.

한 번만 하고 싶다면 즐겨 사용하는 텍스트 편집기에서 파일을 열고 검색 및 바꾸기 매크로를 사용하여 내용을 줄이겠습니다. 사실 오늘은 그냥 그랬어요 :) 그런데 비교적 시간이 오래 걸렸어요.

이 작업을 정기적으로 수행하려면 해당 작업에 맞게 설계된 도구를 사용하십시오. 바라보다htmlparsing.com그리고Wikipedia HTML 파서 비교.

답변4

간단한 시나리오를 해결했습니다. 샘플 텍스트는 다음과 같습니다.

텍스트.txt

book title>The Linux Command Line< font 32 - extra
book title>How Linux Works< font 32 - extra
book title>UNIX and Linux System Administration Handbook< font 32 - extra
book title>Raspberry Pi Cookbook< font 32 - extra
book title>Linux Bible< font 32 - extra
book title>The Linux Programming Interface< font 32 - extra

주문하다

$ cat text.txt | awk 'BEGIN {FS=">"} {print $2} | awk 'BEGIN {FS="<"} {print $1}'

산출

The Linux Command Line
How Linux Works
UNIX and Linux System Administration Handbook
Raspberry Pi Cookbook
Linux Bible
The Linux Programming Interface

관련 정보