텍스트 파일에서 정보 추출

Question 1

이는 XML 또는 유사한 마크업 언어 파일처럼 보입니다. 이러한 파일은 깨어나지 않도록 간단한 정규식으로 구문 분석하면 안 됩니다.TO ͇̹̺ͅ松̴ş̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡. 해당 태그와 선호하는 스크립트 언어에 특정한 파서를 사용해야 합니다.

이는 OMIM 또는 HPO 데이터처럼 보이며, 이 경우 간단한 텍스트 파일을 얻고 작업을 단순화할 수 있어야 합니다. 이 파일을 구문 분석할 수 없고 실제로 필요한 경우 Perl에서 수행할 수 있습니다.

perl -lne '/<.*?>([^<>]+)/ && print $1' foo.txt

그러나 한 줄에 여러 개의 레이블이 있거나 레이블의 내용이 여러 줄에 걸쳐 있을 수 있거나 레이블의 데이터에 또는 가 포함될 수 있는 경우 >이는 중단됩니다 <. 귀하의 모든 정보가언제나사이에서 <category="whatever">blah blah</category>모든 것을 더욱 강력하게 얻을 수 있습니다(여러 줄 마크업 콘텐츠 및 포함된 <또는 포함 >).

#!/usr/bin/env perl

## Set the start and end tags
$end="</category>"; 
$start="<category=.*?>"; 

## Read through the file line by line
while(<>){
    ## set $a to one if the current line matches $start
    $a=1 if /$start/; 
    ## If the current line matches $start, capture any relevant content.
    ## I am also removing any $start or $end tags if present.
    if(s/($start)*(.+)($end)*/$2/){
    push @lines,$2 if $a==1;
    }  
    ## If the current line matches $end, capture any relevant content,
    ## print what we have saved so far, set $a back to 0 and empty the
    ## @lines array
    if(/$end/){
    map{s/$end//;}@lines; 
    print "@lines\n";
    @lines=(); 
    $a=0
    }; 
}

이 스크립트를 foo.pl실행 가능하게 만들고 파일에서 실행하려면 다른 이름으로 저장하십시오.

./foo.pl file.txt

예를 들어:

$ cat file.txt 
<category="SpecificDisease">Type II 
 human complement C2 deficiency</category>
<category="Modifier">Huntington disease</category>
<category="CompositeMention">hereditary breast < and ovarian cancer</category>
<category="DiseaseClass">myopathy > cardiopathy</category>

$ ./foo.pl file.txt 
Type II   human complement C2 deficiency
Huntington disease
hereditary breast < and ovarian cancer
myopathy > cardiopathy

하지만 다시 말하지만, 파일이 위의 예보다 더 복잡하다면,이건 실패할 거야그리고 보다 정교한 방법이 필요합니다.

Answer

이는 XML 또는 유사한 마크업 언어 파일처럼 보입니다. 이러한 파일은 깨어나지 않도록 간단한 정규식으로 구문 분석하면 안 됩니다.TO ͇̹̺ͅ松̴ş̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡. 해당 태그와 선호하는 스크립트 언어에 특정한 파서를 사용해야 합니다.