하나의 큰 PDF 파일을 내용에 따라 n개의 PDF 파일로 분할하고 각 분할 파일의 이름을 바꿉니다(Bash에서).

Question 1

빠른 Python이 옵션인가요? PyPDF2 패키지를 사용하면 원하는 것을 정확하게 수행할 수 있습니다.

Answer

빠른 Python이 옵션인가요? PyPDF2 패키지를 사용하면 원하는 것을 정확하게 수행할 수 있습니다.

Question 2

나는 성공했다. 적어도 작동합니다. 하지만 이제는 이 프로세스를 최적화하고 싶습니다. 하나의 큰 PDF에 있는 1000개의 항목을 처리하는 데 최대 40분이 걸릴 수 있습니다.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "${filelist[@]}"; do
 pagecount=$(pdfinfo $file | grep "Pages" | awk '{ print $2 }')
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
 storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')
 pattern=''
 pagetitle=''
 datestamp=''

 #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
 for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

  header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


  pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')


  echo $pageid
  let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces
   pagetitle+="$header+"


   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
#   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    pdftk $file cat $pattern output "$storedid.pdf"
    storedid=0
    pattern=''
    pagetitle=''

   fi
  else 
   #process previous set of pages to output
#  pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   pdftk $file cat $pattern output "$storedid.pdf"
   storedid=$pageid
   pattern="$pageindex "
   pagetitle="$header+"

  fi
 done
done

Answer

나는 성공했다. 적어도 작동합니다. 하지만 이제는 이 프로세스를 최적화하고 싶습니다. 하나의 큰 PDF에 있는 1000개의 항목을 처리하는 데 최대 40분이 걸릴 수 있습니다.

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "${filelist[@]}"; do
 pagecount=$(pdfinfo $file | grep "Pages" | awk '{ print $2 }')
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
 storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')
 pattern=''
 pagetitle=''
 datestamp=''

 #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
 for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

  header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


  pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')


  echo $pageid
  let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces
   pagetitle+="$header+"


   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
#   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    pdftk $file cat $pattern output "$storedid.pdf"
    storedid=0
    pattern=''
    pagetitle=''

   fi
  else 
   #process previous set of pages to output
#  pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   pdftk $file cat $pattern output "$storedid.pdf"
   storedid=$pageid
   pattern="$pageindex "
   pagetitle="$header+"

  fi
 done
done

하나의 큰 PDF 파일을 내용에 따라 n개의 PDF 파일로 분할하고 각 분할 파일의 이름을 바꿉니다(Bash에서).

답변1

답변2

관련 정보