awk, sed, grep, perl... 이 경우 어떤 것을 인쇄해야 할까요?

Question 1

HTML의 형식을 잘 알지 못하는 한 이를 제어할 수 있으며 오류 등은 문제가 되지 않습니다. 정규식을 사용할 수 있지만 위에서 언급한 것처럼 권장되지 않습니다.

제가 직접 많이 사용하는데, 주로 간단한 데이터를 한번에 추출할 때 사용합니다.

예를 들어 Perl을 사용할 수 있습니다.HTML::TokeParser::단순.

매우 단순화됨:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
use HTML::Entities;
use utf8;

die "$0 [file | url]\n" unless defined $ARGV[0];

my $tp;
if ($ARGV[0] =~ /^http:\/\//) {
    $tp = HTML::TokeParser::Simple->new(url => $ARGV[0]);
} else {
    $tp = HTML::TokeParser::Simple->new(file => $ARGV[0]);
}

if (!$tp) {
    die "No HTML file found.\n";
}

# Array to store data.
my @val;
# Index
my $i = 0;

# A bit mixed code with some redundancy. 
# Could be done much simpler, - or much more safe. 
# E.g. Check for thead, tbody etc and call a sub to parse those.
# You could off course also print directly (not save to array),
# but you might want to use the data for something?
while (my $token = $tp->get_token) {
    if ($token->is_start_tag('th') && $token->get_attr('class') eq 'x') {
        $val[$i++] = $tp->get_token->as_is;
    } elsif ($token->is_start_tag('th') && $token->get_attr('class') eq 'R') {
        $val[$i++] = $tp->get_token->as_is;
    } elsif ($token->is_start_tag('td') && (
            ($token->get_attr('class') eq 'x') ||
            ($token->get_attr('class') eq 'R'))) {
        $val[$i++] = decode_entities($tp->get_token->as_is);
    }
}

my @width_col = (10, 8);

if ($i > 2 && !($i % 2)) {
    $i = 0;
    printf("%*s %*s\n",
        $width_col[0], "$val[$i++]",
        $width_col[1], "$val[$i++]"
    );
    while ($i < $#val) {
        printf("%*s %*d\n",
            $width_col[0], "$val[$i++]",
            $width_col[1], "$val[$i++]"
        );
    }
} else {
    die "ERR. Unable to extract data.\n"
}

결과의 예:

$ ./extract htmlsample 
   seconds     reqs
         0    10927
   <= 0.01  1026471
 0.01-0.02   535390
 0.02-0.05    93298

Answer

HTML의 형식을 잘 알지 못하는 한 이를 제어할 수 있으며 오류 등은 문제가 되지 않습니다. 정규식을 사용할 수 있지만 위에서 언급한 것처럼 권장되지 않습니다.

제가 직접 많이 사용하는데, 주로 간단한 데이터를 한번에 추출할 때 사용합니다.

예를 들어 Perl을 사용할 수 있습니다.HTML::TokeParser::단순.

매우 단순화됨:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
use HTML::Entities;
use utf8;

die "$0 [file | url]\n" unless defined $ARGV[0];

my $tp;
if ($ARGV[0] =~ /^http:\/\//) {
    $tp = HTML::TokeParser::Simple->new(url => $ARGV[0]);
} else {
    $tp = HTML::TokeParser::Simple->new(file => $ARGV[0]);
}

if (!$tp) {
    die "No HTML file found.\n";
}

# Array to store data.
my @val;
# Index
my $i = 0;

# A bit mixed code with some redundancy. 
# Could be done much simpler, - or much more safe. 
# E.g. Check for thead, tbody etc and call a sub to parse those.
# You could off course also print directly (not save to array),
# but you might want to use the data for something?
while (my $token = $tp->get_token) {
    if ($token->is_start_tag('th') && $token->get_attr('class') eq 'x') {
        $val[$i++] = $tp->get_token->as_is;
    } elsif ($token->is_start_tag('th') && $token->get_attr('class') eq 'R') {
        $val[$i++] = $tp->get_token->as_is;
    } elsif ($token->is_start_tag('td') && (
            ($token->get_attr('class') eq 'x') ||
            ($token->get_attr('class') eq 'R'))) {
        $val[$i++] = decode_entities($tp->get_token->as_is);
    }
}

my @width_col = (10, 8);

if ($i > 2 && !($i % 2)) {
    $i = 0;
    printf("%*s %*s\n",
        $width_col[0], "$val[$i++]",
        $width_col[1], "$val[$i++]"
    );
    while ($i < $#val) {
        printf("%*s %*d\n",
            $width_col[0], "$val[$i++]",
            $width_col[1], "$val[$i++]"
        );
    }
} else {
    die "ERR. Unable to extract data.\n"
}

결과의 예:

$ ./extract htmlsample 
   seconds     reqs
         0    10927
   <= 0.01  1026471
 0.01-0.02   535390
 0.02-0.05    93298

Question 2

이미 언급했듯이정규 표현식은 HTML 구문 분석에 적합하지 않습니다.. 다른 것과 비슷하다답변 분석이 작업을 수행하려면 아래와 같은 Ruby 문을 만들 수 있습니다. 필요하다는 점 참고해주세요노코체gem( )으로 설치할 수 있습니다 sudo gem install nokogiri.

ruby -rnokogiri -e 'h = Nokogiri::HTML(readlines.join); h.css("tr .x").zip(h.css("tr .R")).each { |d| puts "#{d[0].content} #{d[1].content}" }' sample.html

class="x"이는 Sample.html에서 읽고 해당 속성이 있는 요소 내의 모든 항목과 tr해당 속성이 있는 요소 내의 모든 항목이 쌍을 이루는 2차원 배열을 생성합니다 . 그런 다음 한 줄에 한 쌍씩 인쇄합니다. 귀하의 예를 들어, 출력은 다음과 같습니다class="R"tr

seconds reqs
0 10927
<= 0.01 1026471
0.01-0.02 535390
0.02-0.05 93298

Answer

이미 언급했듯이정규 표현식은 HTML 구문 분석에 적합하지 않습니다.. 다른 것과 비슷하다답변 분석이 작업을 수행하려면 아래와 같은 Ruby 문을 만들 수 있습니다. 필요하다는 점 참고해주세요노코체gem( )으로 설치할 수 있습니다 sudo gem install nokogiri.

ruby -rnokogiri -e 'h = Nokogiri::HTML(readlines.join); h.css("tr .x").zip(h.css("tr .R")).each { |d| puts "#{d[0].content} #{d[1].content}" }' sample.html

class="x"이는 Sample.html에서 읽고 해당 속성이 있는 요소 내의 모든 항목과 tr해당 속성이 있는 요소 내의 모든 항목이 쌍을 이루는 2차원 배열을 생성합니다 . 그런 다음 한 줄에 한 쌍씩 인쇄합니다. 귀하의 예를 들어, 출력은 다음과 같습니다class="R"tr

seconds reqs
0 10927
<= 0.01 1026471
0.01-0.02 535390
0.02-0.05 93298

Question 3

그러면 원하는 필드를 얻는 데 사용할 sed수 있습니다 . cut이것은 한 줄짜리 명령문이지만 명확성을 위해 주석이 달린 스크립트 파일로 작성했습니다.

#!/bin/sed -f
s!</*thead!<tbody!g;      # to not get caught by 'th' below
s!<t[dh][^>]*>!%%%!g;     # replace start tag 'td' or 'th' with a delimitor
s!</t[dh]>!@@@!g;         # replace end tag 'td' or 'th' with a delimitor
s/<[^>]*>//g;             # delete any other tags
s/%%%\([^@]*\)@@@/\1 /g;  # get text between start and stop delimitors with a space
s/ $//                    # remove trailing space

다음과 같이 호출하세요.

$ sed -f glean.sed test.html
seconds reqs %reqs Gbytes %bytes
0 10927  0.47% 0.01  0.18%
&lt;= 0.01 1026471 44.59% 0.11  1.81%
0.01-0.02 535390 23.26% 0.06  0.95%
0.02-0.05 93298  4.05% 0.27  4.29%

그런 다음 원하는 것을 사용하여 처음 두 필드를 얻을 수 있습니다(제가 제안한 대로 cut).

Answer

그러면 원하는 필드를 얻는 데 사용할 sed수 있습니다 . cut이것은 한 줄짜리 명령문이지만 명확성을 위해 주석이 달린 스크립트 파일로 작성했습니다.

#!/bin/sed -f
s!</*thead!<tbody!g;      # to not get caught by 'th' below
s!<t[dh][^>]*>!%%%!g;     # replace start tag 'td' or 'th' with a delimitor
s!</t[dh]>!@@@!g;         # replace end tag 'td' or 'th' with a delimitor
s/<[^>]*>//g;             # delete any other tags
s/%%%\([^@]*\)@@@/\1 /g;  # get text between start and stop delimitors with a space
s/ $//                    # remove trailing space

다음과 같이 호출하세요.

$ sed -f glean.sed test.html
seconds reqs %reqs Gbytes %bytes
0 10927  0.47% 0.01  0.18%
&lt;= 0.01 1026471 44.59% 0.11  1.81%
0.01-0.02 535390 23.26% 0.06  0.95%
0.02-0.05 93298  4.05% 0.27  4.29%

그런 다음 원하는 것을 사용하여 처음 두 필드를 얻을 수 있습니다(제가 제안한 대로 cut).

awk, sed, grep, perl... 이 경우 어떤 것을 인쇄해야 할까요?

답변1

답변2

답변3

관련 정보