PDF OCR 기술을 사용하여 변환 후 텍스트의 결함 제거

PDF OCR 기술을 사용하여 변환 후 텍스트의 결함 제거

저는 PDF 파일을 변환하기 위해 OCR PDF 리더를 사용합니다. 원래 텍스트는 PDF 파일의 이미지였고 PDF Foxit은 이를 OCR을 사용하여 텍스트로 변환했습니다. 이제 변환 후 문제는 텍스트가 올바르게 정렬되지 않고 모든 단어와 줄이 이동된 것처럼 보입니다. 샘플 텍스트

  biochemistry can be divided in three fields; molecular genetics, protein science and metabolism. Over the last decades 
of the 20th century, biochem
istry has through these three disciplines becom
e successful at explaining living processes. Almost all areas o
f the life sciences are being uncovered and developed by biochemical methodology and research.[2] Biochemistry focuses on unde
rstanding how biolog
ical molecules give 
rise to the processes that occur within living cells and
 between cells,[3] which
 in turn relates greatly to the study and understanding of 
, organs, and organism structure and function[4]

Biochemistry is closely related to mol
ecular biology, the study of the molecular mechanisms by which geneti
c information encoded in DNA is able to result in the processes of life.[5]

Much of biochemistry deals with the structu
res, 
 an
d interactions of biological macromolecules, such as proteins, nucleic acids, carbohydrates and lipids, which provide the structure of cells and perform many of the functions associated with life.[6] The chemistry of the cell also depends on the 
 of smaller molecules and ions. Th
ese can be inorganic, for example water and metal ions, or organic, for example the amino acids, which are used to synthesi
ze proteins.[7]
 The mechanisms by which cells harness energy from their environment via chemical reactions are known as metabolism. The findings of biochemistry are applied primarily in medicine, nutrition, and agriculture. In medicine, b
iochemists investigate the causes and cures of diseases.[8] In nutrition, they study how to maintain health wellness and study the effects of nutritional deficiencies.[9] In agriculture, biochemists investigate soil and fertilizers, and try to discover ways to improve crop cultivation, crop storage and pest control.

문제는 일부 단어가 반으로 잘린다는 것입니다. 텍스트를 읽을 수 있도록 수정하기 위해 할 수 있는 일이 있나요?

답변1

개선의 여지가 있을 수 있지만 이것이 시작입니다.

perl -0777 -ne 's/([^ ])$\\n/\1/g; s/\\n/ /g; print' < input | fmt

줄 바꿈을 결합하기 위해 Perl을 사용합니다. 공백으로 끝나면 줄을 계속하고, 그렇지 않으면 줄 바꿈을 완전히 제거하고 출력을 파이프하여 fmt긴 줄을 나눕니다.

답변2

awk다음과 같이 선형을 사용하여 추가 캐리지 리턴을 제거 할 수 있습니다 .

awk '{gsub(/\n/,""); gsub(/\r/,""); print}' RS='' file

biochemistry can be divided in three fields; molecular genetics, protein science and metabolism. Over the last decades of the 20th century, biochemistry has through these three disciplines become successful at explaining living processes. Almost all areas of the life sciences are being uncovered and developed by biochemical methodology and research.[2] Biochemistry focuses on understanding how biological molecules give rise to the processes that occur within living cells and between cells,[3] which in turn relates greatly to the study and understanding of , organs, and organism structure and function[4]
Biochemistry is closely related to molecular biology, the study of the molecular mechanisms by which genetic information encoded in DNA is able to result in the processes of life.[5]
Much of biochemistry deals with the structures,  and interactions of biological macromolecules, such as proteins, nucleic acids, carbohydrates and lipids, which provide the structure of cells and perform many of the functions associated with life.[6] The chemistry of the cell also depends on the  of smaller molecules and ions. These can be inorganic, for example water and metal ions, or organic, for example the amino acids, which are used to synthesize proteins.[7] The mechanisms by which cells harness energy from their environment via chemical reactions are known as metabolism. The findings of biochemistry are applied primarily in medicine, nutrition, and agriculture. In medicine, biochemists investigate the causes and cures of diseases.[8] In nutrition, they study how to maintain health wellness and study the effects of nutritional deficiencies.[9] In agriculture, biochemists investigate soil and fertilizers, and try to discover ways to improve crop cultivation, crop storage and pest control.

gsub함수의 형식은 다음과 같습니다.

gsub(regexp, replacement [, target])

이는 gsub가 찾을 수 있는 가장 길고, 가장 왼쪽에 겹치지 않는 일치하는 하위 문자열을 모두 대체한다는 점을 제외하면 sub 함수와 유사합니다. gsub의 "g"는 "global"을 의미하며 모든 곳에서 대체됨을 의미합니다.

gsub(/\n/,"") replaces all newline occurrences within a string with non for all input text. 

gsub(/\r/,"") replace all carriage return (ASCII code 13) occurrences with non for all input text. 

관련 정보