대용량 파일 끝에서 null 바이트 제거

Question 1

공간을 절약하면서 디스크의 백업 복사본을 만들려면 다음을 사용하세요 gzip.

gzip </dev/sda >/path/to/sda.gz

백업에서 디스크를 복원하려면 다음을 사용하십시오.

gunzip -c /path/to/sda.gz >/dev/sda

이렇게 하면 단순히 후행 NUL 바이트를 제거하는 것보다 더 많은 공간을 절약할 수 있습니다.

후행 NUL 바이트 제거

후행 NUL 바이트를 정말로 제거하고 GNU sed가 있는 경우 다음을 시도해 볼 수 있습니다.

sed '$ s/\x00*$//' /dev/sda >/path/to/sda.stripped

대용량 디스크의 데이터가 sed의 일부 내부 제한을 초과하는 경우 문제가 발생할 수 있습니다. GNU sed에는 데이터 크기에 대한 기본 제한이 없지만GNU sed 매뉴얼시스템 메모리 제한으로 인해 대용량 파일을 처리하지 못할 수 있음을 설명합니다.

GNU sed에는 줄 길이에 대한 기본 제한이 없습니다. 더 많은 (가상) 메모리를 malloc()할 수 있는 한 필요에 따라 줄을 제공하거나 구성할 수 있습니다.

그러나 재귀는 하위 패턴과 무한 반복을 처리하는 데 사용됩니다. 이는 사용 가능한 스택 공간이 일부 모드에서 처리할 수 있는 버퍼 크기를 제한할 수 있음을 의미합니다.

Answer

공간을 절약하면서 디스크의 백업 복사본을 만들려면 다음을 사용하세요 gzip.

gzip </dev/sda >/path/to/sda.gz

백업에서 디스크를 복원하려면 다음을 사용하십시오.

gunzip -c /path/to/sda.gz >/dev/sda

이렇게 하면 단순히 후행 NUL 바이트를 제거하는 것보다 더 많은 공간을 절약할 수 있습니다.

후행 NUL 바이트 제거

후행 NUL 바이트를 정말로 제거하고 GNU sed가 있는 경우 다음을 시도해 볼 수 있습니다.

sed '$ s/\x00*$//' /dev/sda >/path/to/sda.stripped

대용량 디스크의 데이터가 sed의 일부 내부 제한을 초과하는 경우 문제가 발생할 수 있습니다. GNU sed에는 데이터 크기에 대한 기본 제한이 없지만GNU sed 매뉴얼시스템 메모리 제한으로 인해 대용량 파일을 처리하지 못할 수 있음을 설명합니다.

GNU sed에는 줄 길이에 대한 기본 제한이 없습니다. 더 많은 (가상) 메모리를 malloc()할 수 있는 한 필요에 따라 줄을 제공하거나 구성할 수 있습니다.

그러나 재귀는 하위 패턴과 무한 반복을 처리하는 데 사용됩니다. 이는 사용 가능한 스택 공간이 일부 모드에서 처리할 수 있는 버퍼 크기를 제한할 수 있음을 의미합니다.

Question 2

이 문제를 해결하기 위해 간단한 도구를 작성할 수 있습니다.

파일을 읽고 마지막 유효한 바이트(null 아님)를 찾은 다음 파일을 자릅니다.

러스트 예시https://github.com/zqb-all/cut-trailing-bytes:

use std::io;
use std::io::prelude::*;
use std::fs::File;
use std::fs::OpenOptions;
use std::path::PathBuf;
use structopt::StructOpt;
use std::num::ParseIntError;

fn parse_hex(s: &str) -> Result<u8, ParseIntError> {
    u8::from_str_radix(s, 16)
}

#[derive(Debug, StructOpt)]
#[structopt(name = "cut-trailing-bytes", about = "A tool for cut trailing bytes, default cut trailing NULL bytes(0x00 in hex)")]
struct Opt {
    /// File to cut
    #[structopt(parse(from_os_str))]
    file: PathBuf,

    /// For example, pass 'ff' if want to cut 0xff
    #[structopt(short = "c", long = "cut-byte", default_value="0", parse(try_from_str = parse_hex))]
    byte_in_hex: u8,

    /// Check the file but don't real cut it
    #[structopt(short, long = "dry-run")]
    dry_run: bool,
}


fn main() -> io::Result<()> {

    let opt = Opt::from_args();
    let filename = &opt.file;
    let mut f = File::open(filename)?;
    let mut valid_len = 0;
    let mut tmp_len = 0;
    let mut buffer = [0; 4096];

    loop {
        let mut n = f.read(&mut buffer[..])?;
        if n == 0 { break; }
        for byte in buffer.bytes() {
            match byte.unwrap() {
                byte if byte == opt.byte_in_hex => { tmp_len += 1; }
                _ => {
                    valid_len += tmp_len;
                    tmp_len = 0;
                    valid_len += 1;
                }
            }
            n -= 1;
            if n == 0 { break; }
        }
    }
    if !opt.dry_run {
        let f = OpenOptions::new().write(true).open(filename);
        f.unwrap().set_len(valid_len)?;
    }
    println!("cut {} from {} to {}", filename.display(), valid_len + tmp_len, valid_len);

    Ok(())
}

Answer

이 문제를 해결하기 위해 간단한 도구를 작성할 수 있습니다.

파일을 읽고 마지막 유효한 바이트(null 아님)를 찾은 다음 파일을 자릅니다.

러스트 예시https://github.com/zqb-all/cut-trailing-bytes:

use std::io;
use std::io::prelude::*;
use std::fs::File;
use std::fs::OpenOptions;
use std::path::PathBuf;
use structopt::StructOpt;
use std::num::ParseIntError;

fn parse_hex(s: &str) -> Result<u8, ParseIntError> {
    u8::from_str_radix(s, 16)
}

#[derive(Debug, StructOpt)]
#[structopt(name = "cut-trailing-bytes", about = "A tool for cut trailing bytes, default cut trailing NULL bytes(0x00 in hex)")]
struct Opt {
    /// File to cut
    #[structopt(parse(from_os_str))]
    file: PathBuf,

    /// For example, pass 'ff' if want to cut 0xff
    #[structopt(short = "c", long = "cut-byte", default_value="0", parse(try_from_str = parse_hex))]
    byte_in_hex: u8,

    /// Check the file but don't real cut it
    #[structopt(short, long = "dry-run")]
    dry_run: bool,
}


fn main() -> io::Result<()> {

    let opt = Opt::from_args();
    let filename = &opt.file;
    let mut f = File::open(filename)?;
    let mut valid_len = 0;
    let mut tmp_len = 0;
    let mut buffer = [0; 4096];

    loop {
        let mut n = f.read(&mut buffer[..])?;
        if n == 0 { break; }
        for byte in buffer.bytes() {
            match byte.unwrap() {
                byte if byte == opt.byte_in_hex => { tmp_len += 1; }
                _ => {
                    valid_len += tmp_len;
                    tmp_len = 0;
                    valid_len += 1;
                }
            }
            n -= 1;
            if n == 0 { break; }
        }
    }
    if !opt.dry_run {
        let f = OpenOptions::new().write(true).open(filename);
        f.unwrap().set_len(valid_len)?;
    }
    println!("cut {} from {} to {}", filename.display(), valid_len + tmp_len, valid_len);

    Ok(())
}

Question 3

John1024의 명령을 시도했는데 sed대부분의 경우 작동했지만 일부 대용량 파일의 경우 제대로 다듬어지지 않았습니다. 다음은 항상 유효합니다.

python -c "open('file-stripped.bin', 'wb').write(open('file.bin', 'rb').read().rstrip(b'\0'))"

먼저 파일을 메모리에 로드합니다. 파일을 청크로 처리하는 적절한 Python 스크립트를 작성하면 이를 방지할 수 있습니다.

Answer

John1024의 명령을 시도했는데 sed대부분의 경우 작동했지만 일부 대용량 파일의 경우 제대로 다듬어지지 않았습니다. 다음은 항상 유효합니다.

python -c "open('file-stripped.bin', 'wb').write(open('file.bin', 'rb').read().rstrip(b'\0'))"

먼저 파일을 메모리에 로드합니다. 파일을 청크로 처리하는 적절한 Python 스크립트를 작성하면 이를 방지할 수 있습니다.

Question 4

적어도 Linux(및 최신 ext4와 같이 이를 지원하는 파일 시스템)에서는 fallocate -d이러한 0 시퀀스를 디스크 공간을 차지하지 않는 구멍으로 바꿀 수 있습니다.

$ echo test > a
$ head -c1G /dev/zero >> a
$ echo test2 >> a
$ head -c1G /dev/zero >> a
$ du -h a
2.1G    a
$ ls -l a
-rw-r--r-- 1 stephane stephane 2147483659 May  5 06:23 a

2GiB 대용량 파일은 2GiB의 디스크 공간을 차지합니다.

$ fallocate -d a
$ ls -l a
-rw-r--r-- 1 stephane stephane 2147483659 May  5 06:23 a
$ du -h a
12K     a

동일한 2GiB 파일이지만 이제 디스크 공간은 12KiB만 차지합니다.

$ filefrag -v a
Filesystem type is: ef53
File size of a is 2147483659 (524289 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    7504727..   7504727:      1:
   1:   262144..  262144:   48424960..  48424960:      1:    7766871: last
a: 2 extents found

다음을 사용하여 후행 구멍을 제거할 수 있습니다.

truncate -os 262145 a

이제 마지막 청크에 데이터가 포함되어야 합니다.

$ tail -c4096 a | hd
00000000  00 00 00 00 00 74 65 73  74 32 0a 00 00 00 00 00  |.....test2......|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

마지막 블록에서 뒤에 오는 0을 제거할 수도 있지만 디스크 공간이 절약되지는 않는다는 점에 유의하세요.

Answer

적어도 Linux(및 최신 ext4와 같이 이를 지원하는 파일 시스템)에서는 fallocate -d이러한 0 시퀀스를 디스크 공간을 차지하지 않는 구멍으로 바꿀 수 있습니다.

$ echo test > a
$ head -c1G /dev/zero >> a
$ echo test2 >> a
$ head -c1G /dev/zero >> a
$ du -h a
2.1G    a
$ ls -l a
-rw-r--r-- 1 stephane stephane 2147483659 May  5 06:23 a

2GiB 대용량 파일은 2GiB의 디스크 공간을 차지합니다.

$ fallocate -d a
$ ls -l a
-rw-r--r-- 1 stephane stephane 2147483659 May  5 06:23 a
$ du -h a
12K     a

동일한 2GiB 파일이지만 이제 디스크 공간은 12KiB만 차지합니다.

$ filefrag -v a
Filesystem type is: ef53
File size of a is 2147483659 (524289 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    7504727..   7504727:      1:
   1:   262144..  262144:   48424960..  48424960:      1:    7766871: last
a: 2 extents found

다음을 사용하여 후행 구멍을 제거할 수 있습니다.

truncate -os 262145 a

이제 마지막 청크에 데이터가 포함되어야 합니다.

$ tail -c4096 a | hd
00000000  00 00 00 00 00 74 65 73  74 32 0a 00 00 00 00 00  |.....test2......|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

마지막 블록에서 뒤에 오는 0을 제거할 수도 있지만 디스크 공간이 절약되지는 않는다는 점에 유의하세요.

대용량 파일 끝에서 null 바이트 제거

답변1

후행 NUL 바이트 제거

답변2

답변3

답변4

관련 정보