PDF에서 중복 페이지 제거

Question 1

comparepdfPDF를 비교하기 위한 명령줄 도구입니다. 0파일이 동일하면 종료 코드이고, 그렇지 않으면 0이 아닙니다. 텍스트 내용이나 시각적으로 비교할 수 있습니다(예: 재미로 스캔).

comparepdf 1.pdf 2.pdf
comparepdf -ca 1.pdf 2.pdf #compare appearance instead of text

따라서 당신이 할 수 있는 일은 PDF를 분할한 다음 쌍으로 비교하고 그에 따라 삭제하는 것입니다.

#!/bin/bash
#explode pdf
pdftk original.pdf burst
#compare 900 pages pairwise
for (( i=1 ; i<=899 ; i++ )) ; do
  #pdftk's naming is pg_0001.pdf, pg_0002.pdf etc.
  pdf1=pg_$(printf 04d $i).pdf
  pdf2=pg_$(printf 04d $((i+1))).pdf
  #Remove first file if match. Loop not forwarded in case of three or more consecutive identical pages 
  if comparepdf $pdf1 $pdf2 ; then
     rm $pdf1
  fi
done
#renunite in sorted manner:
pdftk $(find -name 'pg_*.pdf' | sort ) cat output new.pdf

편집: @notauto 생성된 의견에 따라 단일 페이지 PDF를 통합하는 대신 원본 파일에서 페이지를 선택하도록 선택할 수 있습니다. 쌍별 비교가 완료되면 다음 작업을 수행할 수 있습니다.

pdftk original.pdf cat $(find -name 'pg_*.pdf' |
                        awk -F '[._]' '{printf "%d\n",$3}' |
                        sort -n ) output new.pdf

Answer

comparepdfPDF를 비교하기 위한 명령줄 도구입니다. 0파일이 동일하면 종료 코드이고, 그렇지 않으면 0이 아닙니다. 텍스트 내용이나 시각적으로 비교할 수 있습니다(예: 재미로 스캔).

comparepdf 1.pdf 2.pdf
comparepdf -ca 1.pdf 2.pdf #compare appearance instead of text

따라서 당신이 할 수 있는 일은 PDF를 분할한 다음 쌍으로 비교하고 그에 따라 삭제하는 것입니다.

#!/bin/bash
#explode pdf
pdftk original.pdf burst
#compare 900 pages pairwise
for (( i=1 ; i<=899 ; i++ )) ; do
  #pdftk's naming is pg_0001.pdf, pg_0002.pdf etc.
  pdf1=pg_$(printf 04d $i).pdf
  pdf2=pg_$(printf 04d $((i+1))).pdf
  #Remove first file if match. Loop not forwarded in case of three or more consecutive identical pages 
  if comparepdf $pdf1 $pdf2 ; then
     rm $pdf1
  fi
done
#renunite in sorted manner:
pdftk $(find -name 'pg_*.pdf' | sort ) cat output new.pdf

편집: @notauto 생성된 의견에 따라 단일 페이지 PDF를 통합하는 대신 원본 파일에서 페이지를 선택하도록 선택할 수 있습니다. 쌍별 비교가 완료되면 다음 작업을 수행할 수 있습니다.

pdftk original.pdf cat $(find -name 'pg_*.pdf' |
                        awk -F '[._]' '{printf "%d\n",$3}' |
                        sort -n ) output new.pdf

Question 2

다음은 @FelixJN 코드의 수정된 버전입니다. 여기서 printf 형식 문자열의 오타를 수정했습니다. 코드는 제가 확인했으며 정상적으로 작동합니다.

#!/bin/bash
pdftk original.pdf burst  #explode the pdf
#the resulting files are named as  pg_0001.pdf, pg_0002.pdf etc.

for (( i=1 ; i<=1140 ; i++ )) ; do #loop over all the signle-page pdf files
  pdf1=pg_$(printf %04d $i).pdf
  pdf2=pg_$(printf %04d $((i+1))).pdf
  echo $pdf1 $pdf2
  if comparepdf $pdf1 $pdf2 ; then
     rm $pdf1  #remove the first if two adjacent files are duplicate
  fi
done
#merge the remained files in sorted manner:
pdftk $(find -name 'pg_*.pdf' | sort ) cat output new.pdf

Answer

다음은 @FelixJN 코드의 수정된 버전입니다. 여기서 printf 형식 문자열의 오타를 수정했습니다. 코드는 제가 확인했으며 정상적으로 작동합니다.

#!/bin/bash
pdftk original.pdf burst  #explode the pdf
#the resulting files are named as  pg_0001.pdf, pg_0002.pdf etc.

for (( i=1 ; i<=1140 ; i++ )) ; do #loop over all the signle-page pdf files
  pdf1=pg_$(printf %04d $i).pdf
  pdf2=pg_$(printf %04d $((i+1))).pdf
  echo $pdf1 $pdf2
  if comparepdf $pdf1 $pdf2 ; then
     rm $pdf1  #remove the first if two adjacent files are duplicate
  fi
done
#merge the remained files in sorted manner:
pdftk $(find -name 'pg_*.pdf' | sort ) cat output new.pdf

Question 3

해당 도구에 액세스할 수 없는 경우 comparepdf다음이 나에게 도움이 된 솔루션이었습니다(FelixJN의 답변 사용).

#explode pdf
pdftk original.pdf burst

#delete consecutive pages that have the same size        
last=-1; find . -type f -name '*.pdf' -printf '%f\0' | sort -nz | 
    while read -d '' i; do 
        s=$(stat -c '%s' "$i"); 
        [[ $s = $last ]] && rm "$i"; 
    last=$s; 
done

#rearrange the pdf
pdftk original.pdf cat $(find -name 'pg_*.pdf' |
                        awk -F '[._]' '{printf "%d\n",$3}' |
                        sort -n ) output new.pdf

삭제하면 안되는 페이지가 삭제될 수도 있지만 확률은 낮다고 생각합니다. 동일한 크기의 파일 소스를 제거합니다.디렉토리에서 같은 크기의 파일을 삭제하는 방법은 무엇입니까?

Answer

해당 도구에 액세스할 수 없는 경우 comparepdf다음이 나에게 도움이 된 솔루션이었습니다(FelixJN의 답변 사용).

#explode pdf
pdftk original.pdf burst

#delete consecutive pages that have the same size        
last=-1; find . -type f -name '*.pdf' -printf '%f\0' | sort -nz | 
    while read -d '' i; do 
        s=$(stat -c '%s' "$i"); 
        [[ $s = $last ]] && rm "$i"; 
    last=$s; 
done

#rearrange the pdf
pdftk original.pdf cat $(find -name 'pg_*.pdf' |
                        awk -F '[._]' '{printf "%d\n",$3}' |
                        sort -n ) output new.pdf

삭제하면 안되는 페이지가 삭제될 수도 있지만 확률은 낮다고 생각합니다. 동일한 크기의 파일 소스를 제거합니다.디렉토리에서 같은 크기의 파일을 삭제하는 방법은 무엇입니까?

PDF에서 중복 페이지 제거

답변1

답변2

답변3

관련 정보