텍스트 처리(두 파일 읽기 및 계산) [awk, 스크립트]

Question 1

awk '
    NR==FNR{                                #operate matrix file first
        A[$1] = 1                           #array of words
        for(i=2;i<=NF;i++)
            B[$1 OFS i] = $i                #array with indexes [word field_num]
        next
        }
    $1 in A{                                #if word in array A
        max = $1 OFS 2
        for(i in B)
            if(i ~ "^" $1 && B[max] < B[i])
                max = i                     #find maximum in B-array
        print max, B[max]                   #output word + field_num + value
        delete B[max]                       #exclude value from next search 
        }
    }
    ' matrix list

awk 버전이 의사 다차원 배열을 허용하면 스크립트가 단순화될 수 있습니다.

awk '
    NR==FNR{                                
        for(i=2;i<=NF;i++)
            A[$1][i] = $i                   
        next
        }
    $1 in A{
        max = 2
        for(i in A[$1])
            if(A[$1][max] < A[$1][i])
                max = i
        print $1, max, A[$1][max]
        delete A[$1][max]
        }
    }
    ' matrix list

Answer

awk '
    NR==FNR{                                #operate matrix file first
        A[$1] = 1                           #array of words
        for(i=2;i<=NF;i++)
            B[$1 OFS i] = $i                #array with indexes [word field_num]
        next
        }
    $1 in A{                                #if word in array A
        max = $1 OFS 2
        for(i in B)
            if(i ~ "^" $1 && B[max] < B[i])
                max = i                     #find maximum in B-array
        print max, B[max]                   #output word + field_num + value
        delete B[max]                       #exclude value from next search 
        }
    }
    ' matrix list

awk 버전이 의사 다차원 배열을 허용하면 스크립트가 단순화될 수 있습니다.

awk '
    NR==FNR{                                
        for(i=2;i<=NF;i++)
            A[$1][i] = $i                   
        next
        }
    $1 in A{
        max = 2
        for(i in A[$1])
            if(A[$1][max] < A[$1][i])
                max = i
        print $1, max, A[$1][max]
        delete A[$1][max]
        }
    }
    ' matrix list

Question 2

이것은 실제로 매우 복잡합니다. awk누군가 기적적인 대사를 내놓지 않는 한 스크립트를 작성하는 것이 좋습니다 .

파일 에서 awk:

NR==FNR {

    a[$1]++
    next

} #Your probably know what that does since it's your starting point

# If first field is a key in array a
$1 in a { 
    # Assign the number of occurences of this word in variable n
    n=a[$1]  
    # Initialize this value to + INFINITY  
    k=-log(0)

    # Loop on the number of occurences of the word
    for (i=0; i<n; i++) {
        # Initialize max value and its index at the first value of the vector
        m=$2
        i_m=2

        # Loop on the number of fields in the matrix for that word
        for (j=3; j<NF+1; j++) {

            # Look for the largest value that stays below previous max (if none then k is INFINITY)
            if ($j > m && $j < k) { m=$j; i_m=j }

        }
        # Print the word, the index of its max and its value
        printf $1" "i_m" "m"\n"
        # Store the max to be able to scan for the next biggest number at next iteration
        k=m
    }

}

실행하세요:

$ awk -f myScript.awk list matrix

내 스크립트는 한 가지 경우를 제외하고는 잘 작동하는 것 같습니다. 즉, 단어가 나타나는 경우 의 횟수가 list해당 벡터의 값보다 크거나 같습니다 matrix. 벡터가 매우 크기 때문에 여기서는 문제가 되지 않는 것 같습니다. 또한 값 을 얻기 위한 kat의 초기화는 약간 이상하지만 직접 설정하는 방법을 모르겠습니다 ( 분명히 작동하지 않습니다). 아마도 몇 가지 사례를 더 처리하도록 할 수 있지만(예: 벡터에 동일한 값이 여러 번 있는 경우...) 이제 시작점이 있으므로 이를 맡기겠습니다.-log(0)infinf=inf

Answer

이것은 실제로 매우 복잡합니다. awk누군가 기적적인 대사를 내놓지 않는 한 스크립트를 작성하는 것이 좋습니다 .

파일 에서 awk:

NR==FNR {

    a[$1]++
    next

} #Your probably know what that does since it's your starting point

# If first field is a key in array a
$1 in a { 
    # Assign the number of occurences of this word in variable n
    n=a[$1]  
    # Initialize this value to + INFINITY  
    k=-log(0)

    # Loop on the number of occurences of the word
    for (i=0; i<n; i++) {
        # Initialize max value and its index at the first value of the vector
        m=$2
        i_m=2

        # Loop on the number of fields in the matrix for that word
        for (j=3; j<NF+1; j++) {

            # Look for the largest value that stays below previous max (if none then k is INFINITY)
            if ($j > m && $j < k) { m=$j; i_m=j }

        }
        # Print the word, the index of its max and its value
        printf $1" "i_m" "m"\n"
        # Store the max to be able to scan for the next biggest number at next iteration
        k=m
    }

}

실행하세요:

$ awk -f myScript.awk list matrix

내 스크립트는 한 가지 경우를 제외하고는 잘 작동하는 것 같습니다. 즉, 단어가 나타나는 경우 의 횟수가 list해당 벡터의 값보다 크거나 같습니다 matrix. 벡터가 매우 크기 때문에 여기서는 문제가 되지 않는 것 같습니다. 또한 값 을 얻기 위한 kat의 초기화는 약간 이상하지만 직접 설정하는 방법을 모르겠습니다 ( 분명히 작동하지 않습니다). 아마도 몇 가지 사례를 더 처리하도록 할 수 있지만(예: 벡터에 동일한 값이 여러 번 있는 경우...) 이제 시작점이 있으므로 이를 맡기겠습니다.-log(0)infinf=inf

Question 3

TxR어눌한 말투와앗매크로:

(let ((h (hash :equal-based)))
  (awk (:inputs "word-dom-pairs")
    (t (inc [h [f 0] 0])))
  (awk (:inputs "word-vectors")
    (t (whenlet ((count [h [f 0]]))
         (fconv - : r)
         (let* ((n-fn-pairs (zip (rest f) (range 2)))
                (n-fn-sorted [sort n-fn-pairs > first]))
           (each ((p [n-fn-sorted 0..count]))
             (prn [f 0] (second p) (first p))))))))

달리기:

$ txr munge.tl 
bank 4 3.2
bank 3 1.5
bank 2 0.9
God 3 2.1

데이터:

$ cat word-dom-pairs 
car transport
car machine
bank economy
bank politics
bank parks
God religion

$ cat word-vectors 
bank 0.9 1.5 3.2 -0.2 0.1
God 1.0 2.1 -0.5 0.7
rose 0.2 -1.8

다음은 단일 표현식으로 결합된 프로그램 버전입니다 awk.

(awk (:inputs "word-dom-pairs" "word-vectors")
     (:let (h (hash :equal-based)))
     ((= arg 1) (inc [h [f 0] 0]))
     ((= arg 2) (whenlet ((count [h [f 0]]))
                  (fconv - : r)
                  (let* ((n-fn-pairs (zip (rest f) (range 2)))
                         (n-fn-sorted [sort n-fn-pairs > first]))
                    (each ((p [n-fn-sorted 0..count]))
                      (prn [f 0] (second p) (first p)))))))

:inputs이전에 분리되었던 두 개의 awk-s가 하나로 결합되었습니다. t변수에 의해 제공된 입력을 기반으로 처리하여 무조건 참인 패턴을 선택기로 대체 합니다 arg. 바인딩된 해시 테이블 변수의 내용은 letawk 매크로 절로 접혀집니다 :let.

해당 절을 제거하면 (:inputs ...)명령줄 인수 쌍을 사용하여 파일을 제공할 수 있습니다.

$ txr munge.tl file1 file2

TXR Lisp는 변수를 할당하거나 사용하기 전에 정의해야 하는 유형이 안전한 동적 언어입니다. 존재하지 않는 변수와 쓰레기 문자열은 숫자 0도 아니고 문자열도 아닙니다.바라보다비슷한 숫자는 그 숫자가 아닙니다. 이것이 바로 우리가 해시 테이블의 존재를 명시적으로 정의하고 fconv두 번째 및 후속 필드를 실수로 명시적으로 변환하는 방법( r)을 사용하는 이유입니다.

Answer