인용된 인쇄 형식으로 인코딩된 줄 연결

Question 1

마지막으로 설명하는 작업은 mbox를 이메일 콘텐츠로 구문 분석한 다음(첨부 파일 수를 세고 있습니다) 해당 콘텐츠를 보는 것입니다. 옳은 일!

이렇게 하세요. 실제로 mbox 파일을 메시지로 구문 분석하면 약간 기분이 좋아질 것입니다. 여기 내 머리 꼭대기가 있습니다. 테스트되지 않았지만 편집기 코드에서는 빨간색이 아닙니다.

#!/usr/bin/env python3
import mailbox
import re as regex

pattern = regex.compile("BRIANSREGEX")

mb = mailbox.mbox("Briansmails.mbox", create=False)
for msg in mb:
    print(f"{msg['Subject']}")
    for part in msg.walk():
        print(f"| {part.get_content_type()}, {part.get_content_charset()}")
        print("-"*120)
        payload = part.get_payload(decode=True)
        charset = part.get_content_charset
        if type(payload) is bytes:
            content = ""
            try:
                content = payload.decode(charset)
            except:
                # print("failed to decode")
                try:
                    content = payload.decode() # try ascii
            match = pattern.search(content)
            # do what you want with that match…
            if match:
                print(f"| matched at {match.start()}")
        print("-"*120)

Answer

마지막으로 설명하는 작업은 mbox를 이메일 콘텐츠로 구문 분석한 다음(첨부 파일 수를 세고 있습니다) 해당 콘텐츠를 보는 것입니다. 옳은 일!

이렇게 하세요. 실제로 mbox 파일을 메시지로 구문 분석하면 약간 기분이 좋아질 것입니다. 여기 내 머리 꼭대기가 있습니다. 테스트되지 않았지만 편집기 코드에서는 빨간색이 아닙니다.

#!/usr/bin/env python3
import mailbox
import re as regex

pattern = regex.compile("BRIANSREGEX")

mb = mailbox.mbox("Briansmails.mbox", create=False)
for msg in mb:
    print(f"{msg['Subject']}")
    for part in msg.walk():
        print(f"| {part.get_content_type()}, {part.get_content_charset()}")
        print("-"*120)
        payload = part.get_payload(decode=True)
        charset = part.get_content_charset
        if type(payload) is bytes:
            content = ""
            try:
                content = payload.decode(charset)
            except:
                # print("failed to decode")
                try:
                    content = payload.decode() # try ascii
            match = pattern.search(content)
            # do what you want with that match…
            if match:
                print(f"| matched at {match.start()}")
        print("-"*120)

Question 2

나는 Marcus Müller의 답변에 엄지손가락을 치켜세웠지만 이것은 그의 답변을 기반으로 한 버전입니다.

#!/usr/bin/env python3
import mailbox
import re
import sys

byte_pattern = re.compile(b"https?://[^/]*imgur.com/[a-zA-Z0-9/.]*")
str_pattern = re.compile("https?://[^/]*imgur.com/[a-zA-Z0-9/.]*")

mb = mailbox.mbox(sys.argv[1], create=False)
for msg in mb:
    for part in msg.walk():
        if part.is_multipart():
            continue
        payload = part.get_payload(decode=True)
        if type(payload) is bytes:
            # first, search it as a binary string
            for match in byte_pattern.findall(payload):
                print(match.decode('ascii'))
            # then, try to decode it in case it's utf-16 or something weird
            charset = part.get_content_charset()
            if charset and charset != 'utf-8':
                try:
                    content = payload.decode(charset)
                    for match in str_pattern.findall(content):
                        print(match)
                except:
                    pass
        else:
            print('failed to get message part as bytes')

리프가 아닌 노드를 포함하는 깊이 우선 탐색을 수행하므로 part트리의 리프 노드가 아닌 다중 부분 메시지일 수 있습니다 .walk

리프 노드인 경우에만 먼저 패턴을 바이트 문자열로 검색한 다음 지정된 문자 세트(UTF-8이 아닌 경우)를 사용하여 텍스트로 디코딩하려고 시도합니다. (가장 많이 사용되는 UTF-8이라면 이미 바이트열로 검색된 상태입니다.)

Answer

나는 Marcus Müller의 답변에 엄지손가락을 치켜세웠지만 이것은 그의 답변을 기반으로 한 버전입니다.

#!/usr/bin/env python3
import mailbox
import re
import sys

byte_pattern = re.compile(b"https?://[^/]*imgur.com/[a-zA-Z0-9/.]*")
str_pattern = re.compile("https?://[^/]*imgur.com/[a-zA-Z0-9/.]*")

mb = mailbox.mbox(sys.argv[1], create=False)
for msg in mb:
    for part in msg.walk():
        if part.is_multipart():
            continue
        payload = part.get_payload(decode=True)
        if type(payload) is bytes:
            # first, search it as a binary string
            for match in byte_pattern.findall(payload):
                print(match.decode('ascii'))
            # then, try to decode it in case it's utf-16 or something weird
            charset = part.get_content_charset()
            if charset and charset != 'utf-8':
                try:
                    content = payload.decode(charset)
                    for match in str_pattern.findall(content):
                        print(match)
                except:
                    pass
        else:
            print('failed to get message part as bytes')

리프가 아닌 노드를 포함하는 깊이 우선 탐색을 수행하므로 part트리의 리프 노드가 아닌 다중 부분 메시지일 수 있습니다 .walk

리프 노드인 경우에만 먼저 패턴을 바이트 문자열로 검색한 다음 지정된 문자 세트(UTF-8이 아닌 경우)를 사용하여 텍스트로 디코딩하려고 시도합니다. (가장 많이 사용되는 UTF-8이라면 이미 바이트열로 검색된 상태입니다.)

인용된 인쇄 형식으로 인코딩된 줄 연결

답변1

답변2

관련 정보