쉘을 사용하여 HTML을 테이블로 변환하는 방법

Question 1

다음은 어느 정도 트릭을 수행해야 합니다. 저를 기억하십시오.

테스트 없이 그냥 작성했습니다.편집: 이제 테스트하고 몇 가지 버그를 수정했으므로 제대로 작동하는 것 같습니다.
나는 극단적인 경우(다중 <h1>, <tbody>테이블 필드 내 등...)를 무시합니다.

"scriptname.pl"에 넣고 2번째와 3번째 줄의 파일 이름을 변경한 후 실행하세요.perl scriptname.pl

#!/usr/bin/perl
open my $ifh, "inputfilename.html";
open my $ofh, ">outputfilename.html";
while(<$ifh>) {
  if(/<h1>(.*)<\/h1>/) {
    my $header = << "END";
  <table>
    <caption>$1</caption>
    <thead>
        <tr>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
        </tr>
    </thead>
    <tbody>
END
    print $ofh $header;
  } elsif(/<div class="row">/) {
    print $ofh "<tr>\n";
  } elsif(/<\/div>/) {
    print $ofh "</tr>\n";
  } elsif(/<p class=".*?">(.*)<\/p>/) {
    print $ofh "<td>$1</td>\n";
  } elsif(/<\/body>/) {
    print $ofh "</tbody>\n</table>\n</body>\n";
  } else {
    print $ofh $_;
  }
}
close $ofh;
close $ifh;

Answer

다음은 어느 정도 트릭을 수행해야 합니다. 저를 기억하십시오.

테스트 없이 그냥 작성했습니다.편집: 이제 테스트하고 몇 가지 버그를 수정했으므로 제대로 작동하는 것 같습니다.
나는 극단적인 경우(다중 <h1>, <tbody>테이블 필드 내 등...)를 무시합니다.

"scriptname.pl"에 넣고 2번째와 3번째 줄의 파일 이름을 변경한 후 실행하세요.perl scriptname.pl

#!/usr/bin/perl
open my $ifh, "inputfilename.html";
open my $ofh, ">outputfilename.html";
while(<$ifh>) {
  if(/<h1>(.*)<\/h1>/) {
    my $header = << "END";
  <table>
    <caption>$1</caption>
    <thead>
        <tr>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
        </tr>
    </thead>
    <tbody>
END
    print $ofh $header;
  } elsif(/<div class="row">/) {
    print $ofh "<tr>\n";
  } elsif(/<\/div>/) {
    print $ofh "</tr>\n";
  } elsif(/<p class=".*?">(.*)<\/p>/) {
    print $ofh "<td>$1</td>\n";
  } elsif(/<\/body>/) {
    print $ofh "</tbody>\n</table>\n</body>\n";
  } else {
    print $ofh $_;
  }
}
close $ofh;
close $ifh;

Question 2

셀을 하나씩 추출하려고 하므로 테이블을 다시 작성하기가 더 어려워집니다.

사용이 간편 bash하며 다음과 pup같은 사항만 적용됩니다.

#!/bin/bash

count=$(grep '<div ' demo.html | wc -l)
page_title=$(cat demo.html | pup 'body h1 text{}')

tbody() {
    for ((i=1;i<count+1;++i)); do
        IFS=, row=$(cat demo.html | pup "body div.row:nth-of-type($i) text{}" | grep '\S' | paste -s -d, -)
        printf "\t\t<tr>\n"
        printf '\t\t\t<td>%s</td>\n' $row
        printf "\t\t</tr>\n"
    done
}

cat <<EOF
<table>
    <caption>$page_title</caption>
    <thead>
        <tr>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
        </tr>
    </thead>
    <tbody>
`tbody`
    </tbody>
</table>
EOF

산출

<table>
    <caption>Page Title</caption>
    <thead>
        <tr>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Text 1</td>
            <td>Text 2</td>
            <td>Text 3</td>
            <td>Text 4</td>
            <td>Text 5</td>
            <td>Text 6</td>
        </tr>
        <tr>
            <td>Text 1</td>
            <td>Text 2</td>
            <td>Text 3</td>
            <td>Text 4</td>
            <td>Text 5</td>
            <td>Text 6</td>
        </tr>
        <tr>
            <td>Text 1</td>
            <td>Text 2</td>
            <td>Text 3</td>
            <td>Text 4</td>
            <td>Text 5</td>
            <td>Text 6</td>
        </tr>
    </tbody>
</table>

설명하다

아이디어는 마지막 행까지 반복하여 행별로 데이터를 추출하는 것입니다. 이 코드 조각은 행 수를 제공합니다.

grep '<div ' demo.html | wc -l

그런 다음 이를 선택기로 사용하면 nth-of-type(n)열 대신 전체 행을 가져올 수 있습니다. grep '\S'빈 줄을 제거 하려면 이를 전달해야 합니다 . 그런 다음 에 전달하면 paste -s -d, -쉼표로 구분된 결과가 생성됩니다.

IFS=, row=$(cat demo.html | pup "body div.row:nth-of-type($i) text{}" | grep '\S' | paste -s -d, -)

각 매개변수 printf '\t\t\t<td>%s</td>\n' $row로 확장되어 다음과 같이 래핑 됩니다 .printf '\t\t\t<td>%s</td>\n' 'Text 1' 'Text 2' ...<td>...</td>

해당 섹션을 완전히 제거하면 들여 \t쓰기된 결과만 인쇄됩니다.

Answer