파일 압축률 추정

Question 1

다음은 (아마도 동등한) Python 버전입니다.스티븐 차제라스 해결책

python -c "
import zlib
from itertools import islice
from functools import partial
import sys
with open(sys.argv[1], "rb") as f:
  compressor = zlib.compressobj()
  t, z = 0, 0.0
  for chunk in islice(iter(partial(f.read, 4096), b''), 0, None, 10):
    t += len(chunk)
    z += len(compressor.compress(chunk))
  z += len(compressor.flush())
  print(z/t)
" file

Answer

다음은 (아마도 동등한) Python 버전입니다.스티븐 차제라스 해결책

python -c "
import zlib
from itertools import islice
from functools import partial
import sys
with open(sys.argv[1], "rb") as f:
  compressor = zlib.compressobj()
  t, z = 0, 0.0
  for chunk in islice(iter(partial(f.read, 4096), b''), 0, None, 10):
    t += len(chunk)
    z += len(compressor.compress(chunk))
  z += len(compressor.flush())
  print(z/t)
" file

Question 2

예를 들어, 10개 블록마다 압축하여 아이디어를 얻을 수 있습니다.

perl -MIPC::Open2 -nE 'BEGIN{$/=\4096;open2(\*I,\*O,"gzip|wc -c")}
                       if ($. % 10 == 1) {print O $_; $l+=length}
                       END{close O; $c = <I>; say $c/$l}'

(여기서는 4K 블록).

Answer

예를 들어, 10개 블록마다 압축하여 아이디어를 얻을 수 있습니다.

perl -MIPC::Open2 -nE 'BEGIN{$/=\4096;open2(\*I,\*O,"gzip|wc -c")}
                       if ($. % 10 == 1) {print O $_; $l+=length}
                       END{close O; $c = <I>; say $c/$l}'

(여기서는 4K 블록).

Question 3

여러 GB 크기의 파일이 있는데 압축되었는지 확실하지 않아서 처음 10M 바이트를 압축해 테스트했습니다.

head -c 10000000 large_file.bin | gzip | wc -c

완벽하지는 않지만 나에게는 잘 작동합니다.

Answer

여러 GB 크기의 파일이 있는데 압축되었는지 확실하지 않아서 처음 10M 바이트를 압축해 테스트했습니다.

head -c 10000000 large_file.bin | gzip | wc -c

완벽하지는 않지만 나에게는 잘 작동합니다.

Question 4

이것은 iruvar 기반의 향상된 Python 버전입니다.훌륭한 솔루션. 주요 개선 사항은 스크립트가 실제로 압축한 디스크의 데이터 블록만 읽는다는 것입니다.

import zlib
def Predict_file_compression_ratio(MyFilePath):
 blocksize = (4096 * 1) # Increase if you want to read more bytes per block at once.
 blocksize_seek = 0

 # r = read, b = binary
 with open(MyFilePath, "rb") as f:
  # Make a zlib compressor object, and set compression level.
  # 1 is fastest, 9 is slowest
  compressor = zlib.compressobj(1)
  t, z, counter = 0, 0, 0

  while True:
    # Use this modulo calculation to check every "number" of blocks.
    if counter % 10 == 0:
      # Seek to the correct byte position of the file.
      f.seek(blocksize_seek)
      # The block above will be read, increase the seek distance by one block for the next iteration.
      blocksize_seek += blocksize
      # Read data chunk of file into this variable.
      data = f.read(blocksize)
      
      # Stop if there are no more data.
      if not data:
        # For zlib: Flush any remaining compressed data. Not doing this can lead to a tiny inaccuracy.
        z += len(compressor.flush())
        break

      # Uncompressed data size, add size to variable to get a total value.
      t += len(data)
      # Compressed data size
      z += len(compressor.compress(data))

    # When we skip, we want to increase the seek distance. This is vital for correct skipping.
    else:
      blocksize_seek += blocksize
    # Increase the block / iteration counter.
    counter += 1

 # Print the results. But avoid division by 0 >_>
 if not t == 0:
  print('Compression ratio: ' + str(z/t))
 else:
  print('Compression ratio: none, file has no content.')
 print('Compressed: ' + str(z))
 print('Uncompressed: ' + str(t))

높은 데이터 속도가 중요하고 정확한 압축 비율이 그다지 중요하지 않은 경우 lz4를 사용할 수 있습니다. 이는 낮은 CPU 사용량으로 가장 많이 압축할 수 있는 파일을 찾으려는 경우에 유용합니다. 이 모듈은 pip를 사용하여 설치해야 합니다.여기에서. Python 코드 자체에서는 이것이 필요한 전부입니다.

import lz4.block
z += len(lz4.block.compress(data))

이 스크립트를 사용하면 여분의 메모리가 파괴되어(확실히 Windows에서) 파일 성능이 저하될 수 있다는 점을 관찰했습니다. 특히 기존 하드 드라이브가 있는 시스템에서 그리고 한 번에 많은 수의 파일에 이 기능을 사용하는 경우 더욱 그렇습니다. 스크립트의 Python 프로세스에서 낮은 메모리 페이지 우선순위를 설정하면 이러한 메모리 낭비를 피할 수 있습니다. 저는 이 작업을 위해 Windows에서 AutoHotkey를 사용하기로 결정했습니다. 유용한 소스여기.

Answer

이것은 iruvar 기반의 향상된 Python 버전입니다.훌륭한 솔루션. 주요 개선 사항은 스크립트가 실제로 압축한 디스크의 데이터 블록만 읽는다는 것입니다.

import zlib
def Predict_file_compression_ratio(MyFilePath):
 blocksize = (4096 * 1) # Increase if you want to read more bytes per block at once.
 blocksize_seek = 0

 # r = read, b = binary
 with open(MyFilePath, "rb") as f:
  # Make a zlib compressor object, and set compression level.
  # 1 is fastest, 9 is slowest
  compressor = zlib.compressobj(1)
  t, z, counter = 0, 0, 0

  while True:
    # Use this modulo calculation to check every "number" of blocks.
    if counter % 10 == 0:
      # Seek to the correct byte position of the file.
      f.seek(blocksize_seek)
      # The block above will be read, increase the seek distance by one block for the next iteration.
      blocksize_seek += blocksize
      # Read data chunk of file into this variable.
      data = f.read(blocksize)
      
      # Stop if there are no more data.
      if not data:
        # For zlib: Flush any remaining compressed data. Not doing this can lead to a tiny inaccuracy.
        z += len(compressor.flush())
        break

      # Uncompressed data size, add size to variable to get a total value.
      t += len(data)
      # Compressed data size
      z += len(compressor.compress(data))

    # When we skip, we want to increase the seek distance. This is vital for correct skipping.
    else:
      blocksize_seek += blocksize
    # Increase the block / iteration counter.
    counter += 1

 # Print the results. But avoid division by 0 >_>
 if not t == 0:
  print('Compression ratio: ' + str(z/t))
 else:
  print('Compression ratio: none, file has no content.')
 print('Compressed: ' + str(z))
 print('Uncompressed: ' + str(t))

높은 데이터 속도가 중요하고 정확한 압축 비율이 그다지 중요하지 않은 경우 lz4를 사용할 수 있습니다. 이는 낮은 CPU 사용량으로 가장 많이 압축할 수 있는 파일을 찾으려는 경우에 유용합니다. 이 모듈은 pip를 사용하여 설치해야 합니다.여기에서. Python 코드 자체에서는 이것이 필요한 전부입니다.

import lz4.block
z += len(lz4.block.compress(data))

이 스크립트를 사용하면 여분의 메모리가 파괴되어(확실히 Windows에서) 파일 성능이 저하될 수 있다는 점을 관찰했습니다. 특히 기존 하드 드라이브가 있는 시스템에서 그리고 한 번에 많은 수의 파일에 이 기능을 사용하는 경우 더욱 그렇습니다. 스크립트의 Python 프로세스에서 낮은 메모리 페이지 우선순위를 설정하면 이러한 메모리 낭비를 피할 수 있습니다. 저는 이 작업을 위해 Windows에서 AutoHotkey를 사용하기로 결정했습니다. 유용한 소스여기.

파일 압축률 추정

답변1

답변2

답변3

답변4

관련 정보