writev()의 원자성이 어떻게 보장되는지 알고 싶습니다.

Question 1

writev() 시스템 호출을 사용하여 SCTP 소켓에 쓰는 다중 스레드 Linux x86_64 사용자 프로그램이 있습니다. writev() 시스템 호출의 원자성을 확인하고 싶습니다.

writev() 매뉴얼 페이지에는 다음과 같이 명시되어 있습니다.

ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

The data transfers performed by readv() and writev() are atomic: the data written by writev()
is written as a single block that is not intermingled with output from writes in other processes
(but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous
block of data from the file, regardless of read operations performed in other threads or processes
that have file descriptors referring to the same open file description (see open(2)).

그래서 writev() 구현을 보면 잠금 장치가 분명히 보일 것 같습니다. writev() 구현에서 잠금이 보이지 않았을 때 호출을 추적하기 시작했습니다. 내가 찾은 것은 다음과 같습니다. 리눅스 커널 소스코드를 처음으로 살펴보는데 오해가 있는 점 양해 부탁드립니다.

분석된 Linux 커널은 x86에서 4.4.0입니다.

writev() 구현은 fs/read_write.c:896에서 시작됩니다.

SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,u nsigned long, vlen)

그리고 동일한 파일 fs/read_write.c:863에 정의된 vfs_writev()를 호출합니다.

ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
           unsigned long vlen, loff_t *pos)
{
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_WRITE))
        return -EINVAL;

    return do_readv_writev(WRITE, file, vec, vlen, pos);
}

여기서 do_readv_writev()는 fs/read_write.c:798에도 있으며 WRITE 유형의 경우 실행됩니다.

fn = (io_fn_t)file->f_op->write;
iter_fn = file->f_op->write_iter;
file_start_write(file);

file_start_write()는 include/linux/fs.h:2512의 인라인 함수입니다.

static inline void file_start_write(struct file *file)
{
    if (!S_ISREG(file_inode(file)->i_mode))
        return;
    __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}

S_ISREG()는 include/uapi/linux/stat.h:20에 정의되어 있으며 설명자가 일반 파일인지 확인하는 데 사용됩니다.

그리고 __sb_start_write는 fs/super.c:1252에 정의되어 있습니다.

/*
 * This is an internal function, please use sb_start_{write,pagefault,intwrite}
 * instead.
 */
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
    bool force_trylock = false;
    int ret = 1;

#ifdef CONFIG_LOCKDEP
    /*
     * We want lockdep to tell us about possible deadlocks with freezing
     * but it's it bit tricky to properly instrument it. Getting a freeze
     * protection works as getting a read lock but there are subtle
     * problems. XFS for example gets freeze protection on internal level
     * twice in some cases, which is OK only because we already hold a
     * freeze protection also on higher level. Due to these cases we have
     * to use wait == F (trylock mode) which must not fail.
     */
    if (wait) {
        int i;

        for (i = 0; i < level - 1; i++)
            if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
                force_trylock = true;
                break;
            }
    }
#endif
    if (wait && !force_trylock)
        percpu_down_read(sb->s_writers.rw_sem + level-1);
    else
        ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);

    WARN_ON(force_trylock & !ret);
    return ret;
}
EXPORT_SYMBOL(__sb_start_write);

나는 내 커널이 이것을 기반으로 CONFIG_LOCKDEP로 컴파일되었다고 믿지 않습니다. 이것

파일 시스템 잠금은 fs/super.c:1322에서 시작하는 주석에 설명되어 있습니다.

/**
 * freeze_super - lock the filesystem and force it into a consistent state
 * @sb: the super to lock
 *
 * Syncs the super to make sure the filesystem is consistent and calls the fs's
 * freeze_fs.  Subsequent calls to this without first thawing the fs will return
 * -EBUSY.
 *
 * During this function, sb->s_writers.frozen goes through these values:
 *
 * SB_UNFROZEN: File system is normal, all writes progress as usual.
 *
 * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
 * writes should be blocked, though page faults are still allowed. We wait for
 * all writes to complete and then proceed to the next stage.
 *
 * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
 * but internal fs threads can still modify the filesystem (although they
 * should not dirty new pages or inodes), writeback can run etc. After waiting
 * for all running page faults we sync the filesystem which will clean all
 * dirty pages and inodes (no new dirty pages or inodes can be created when
 * sync is running).
 *
 * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
 * modification are blocked (e.g. XFS preallocation truncation on inode
 * reclaim). This is usually implemented by blocking new transactions for
 * filesystems that have them and need this additional guard. After all
 * internal writers are finished we call ->freeze_fs() to finish filesystem
 * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
 * mostly auxiliary for filesystems to verify they do not modify frozen fs.
 *
 * sb->s_writers.frozen is protected by sb->s_umount.
 */

마지막으로 kernel/locking/percpu-rwsem.c:70에서

/*
 * Like the normal down_read() this is not recursive, the writer can
 * come after the first percpu_down_read() and create the deadlock.
 *
 * Note: returns with lock_is_held(brw->rw_sem) == T for lockdep,
 * percpu_up_read() does rwsem_release(). This pairs with the usage
 * of ->rw_sem in percpu_down/up_write().
 */
void percpu_down_read(struct percpu_rw_semaphore *brw)
{
    might_sleep();
    rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 0, _RET_IP_);

    if (likely(update_fast_ctr(brw, +1)))
        return;

    /* Avoid rwsem_acquire_read() and rwsem_release() */
    __down_read(&brw->rw_sem);
    atomic_inc(&brw->slow_read_ctr);
    __up_read(&brw->rw_sem);
}
EXPORT_SYMBOL_GPL(percpu_down_read);

자, 이것이 자물쇠입니다.

Answer

writev() 시스템 호출을 사용하여 SCTP 소켓에 쓰는 다중 스레드 Linux x86_64 사용자 프로그램이 있습니다. writev() 시스템 호출의 원자성을 확인하고 싶습니다.

writev() 매뉴얼 페이지에는 다음과 같이 명시되어 있습니다.

ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

The data transfers performed by readv() and writev() are atomic: the data written by writev()
is written as a single block that is not intermingled with output from writes in other processes
(but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous
block of data from the file, regardless of read operations performed in other threads or processes
that have file descriptors referring to the same open file description (see open(2)).

그래서 writev() 구현을 보면 잠금 장치가 분명히 보일 것 같습니다. writev() 구현에서 잠금이 보이지 않았을 때 호출을 추적하기 시작했습니다. 내가 찾은 것은 다음과 같습니다. 리눅스 커널 소스코드를 처음으로 살펴보는데 오해가 있는 점 양해 부탁드립니다.

분석된 Linux 커널은 x86에서 4.4.0입니다.

writev() 구현은 fs/read_write.c:896에서 시작됩니다.

SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,u nsigned long, vlen)

그리고 동일한 파일 fs/read_write.c:863에 정의된 vfs_writev()를 호출합니다.

ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
           unsigned long vlen, loff_t *pos)
{
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_WRITE))
        return -EINVAL;

    return do_readv_writev(WRITE, file, vec, vlen, pos);
}

여기서 do_readv_writev()는 fs/read_write.c:798에도 있으며 WRITE 유형의 경우 실행됩니다.

fn = (io_fn_t)file->f_op->write;
iter_fn = file->f_op->write_iter;
file_start_write(file);

file_start_write()는 include/linux/fs.h:2512의 인라인 함수입니다.

static inline void file_start_write(struct file *file)
{
    if (!S_ISREG(file_inode(file)->i_mode))
        return;
    __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}

S_ISREG()는 include/uapi/linux/stat.h:20에 정의되어 있으며 설명자가 일반 파일인지 확인하는 데 사용됩니다.

그리고 __sb_start_write는 fs/super.c:1252에 정의되어 있습니다.

/*
 * This is an internal function, please use sb_start_{write,pagefault,intwrite}
 * instead.
 */
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
    bool force_trylock = false;
    int ret = 1;

#ifdef CONFIG_LOCKDEP
    /*
     * We want lockdep to tell us about possible deadlocks with freezing
     * but it's it bit tricky to properly instrument it. Getting a freeze
     * protection works as getting a read lock but there are subtle
     * problems. XFS for example gets freeze protection on internal level
     * twice in some cases, which is OK only because we already hold a
     * freeze protection also on higher level. Due to these cases we have
     * to use wait == F (trylock mode) which must not fail.
     */
    if (wait) {
        int i;

        for (i = 0; i < level - 1; i++)
            if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
                force_trylock = true;
                break;
            }
    }
#endif
    if (wait && !force_trylock)
        percpu_down_read(sb->s_writers.rw_sem + level-1);
    else
        ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);

    WARN_ON(force_trylock & !ret);
    return ret;
}
EXPORT_SYMBOL(__sb_start_write);

나는 내 커널이 이것을 기반으로 CONFIG_LOCKDEP로 컴파일되었다고 믿지 않습니다. 이것

파일 시스템 잠금은 fs/super.c:1322에서 시작하는 주석에 설명되어 있습니다.

/**
 * freeze_super - lock the filesystem and force it into a consistent state
 * @sb: the super to lock
 *
 * Syncs the super to make sure the filesystem is consistent and calls the fs's
 * freeze_fs.  Subsequent calls to this without first thawing the fs will return
 * -EBUSY.
 *
 * During this function, sb->s_writers.frozen goes through these values:
 *
 * SB_UNFROZEN: File system is normal, all writes progress as usual.
 *
 * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
 * writes should be blocked, though page faults are still allowed. We wait for
 * all writes to complete and then proceed to the next stage.
 *
 * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
 * but internal fs threads can still modify the filesystem (although they
 * should not dirty new pages or inodes), writeback can run etc. After waiting
 * for all running page faults we sync the filesystem which will clean all
 * dirty pages and inodes (no new dirty pages or inodes can be created when
 * sync is running).
 *
 * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
 * modification are blocked (e.g. XFS preallocation truncation on inode
 * reclaim). This is usually implemented by blocking new transactions for
 * filesystems that have them and need this additional guard. After all
 * internal writers are finished we call ->freeze_fs() to finish filesystem
 * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
 * mostly auxiliary for filesystems to verify they do not modify frozen fs.
 *
 * sb->s_writers.frozen is protected by sb->s_umount.
 */

마지막으로 kernel/locking/percpu-rwsem.c:70에서

/*
 * Like the normal down_read() this is not recursive, the writer can
 * come after the first percpu_down_read() and create the deadlock.
 *
 * Note: returns with lock_is_held(brw->rw_sem) == T for lockdep,
 * percpu_up_read() does rwsem_release(). This pairs with the usage
 * of ->rw_sem in percpu_down/up_write().
 */
void percpu_down_read(struct percpu_rw_semaphore *brw)
{
    might_sleep();
    rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 0, _RET_IP_);

    if (likely(update_fast_ctr(brw, +1)))
        return;

    /* Avoid rwsem_acquire_read() and rwsem_release() */
    __down_read(&brw->rw_sem);
    atomic_inc(&brw->slow_read_ctr);
    __up_read(&brw->rw_sem);
}
EXPORT_SYMBOL_GPL(percpu_down_read);

자, 이것이 자물쇠입니다.

Question 2

잠금과 원자성은 서로 관련이 없습니다. 잠금은 공유 데이터에 액세스하는 스레드 간의 상호 배타성을 보장하는 데 사용됩니다. 또한 원자성은 작업이 전부 아니면 전무 방식으로 수행되도록 보장합니다.

C6Up1bQ73STi29cA에서 언급했듯이 writev()의 원자성은 preempt_disable()에 의해 보장됩니다. 실제로 VFS 계층은 writev()의 상호 배타성을 보장하지 않습니다. 대신, 파일 시스템(또는 generic_file* 함수 중 하나 - 파일 시스템이 일반 레이어를 사용하는 경우 -)은 파일의 동일한 부분에 대한 여러 writev() 쓰기를 처리해야 합니다.

Answer

잠금과 원자성은 서로 관련이 없습니다. 잠금은 공유 데이터에 액세스하는 스레드 간의 상호 배타성을 보장하는 데 사용됩니다. 또한 원자성은 작업이 전부 아니면 전무 방식으로 수행되도록 보장합니다.

C6Up1bQ73STi29cA에서 언급했듯이 writev()의 원자성은 preempt_disable()에 의해 보장됩니다. 실제로 VFS 계층은 writev()의 상호 배타성을 보장하지 않습니다. 대신, 파일 시스템(또는 generic_file* 함수 중 하나 - 파일 시스템이 일반 레이어를 사용하는 경우 -)은 파일의 동일한 부분에 대한 여러 writev() 쓰기를 처리해야 합니다.

Question 3

그런데 writev()의 처리는 write()의 처리보다 더 특별하지 않습니다.

모든 유형의 파일에 대해 원자성을 보장하지는 않습니다. 찾아보세요 PIPE_BUF. 파이프에 이 양보다 더 많이 쓰면 다른 쓰기와 인터리브될 수 있습니다.

f_pos현재 의 영향을 받는 일반 파일의 경우 f_pos_lock이 사례를 f_pos의 원자적 읽기 및 업데이트로 처리한 다음 pwritev().

이 보호는 비교적 새로운 "수정"(2014)입니다. 그 전에는 Linux가 POSIX를 위반하고 "아무도 신경 쓰지 않는" 때가 있었습니다. Linux 프로그램에서 이 보장에 의존한다면 다소 특이한 일을 하고 있는 것 같습니다 :).

POSIX의 소켓에는 아무런 보장이 없는 것 같습니다. 메일링 리스트 토론을 보면 Linux도 검색 가능한 장치 파일에 대해 이러한 보장을 제공할 수 있을 것 같습니다. tty처럼 검색할 수 없는 것에 대해 보장을 받을 수 있을지 확신할 수 없습니다.

Answer

그런데 writev()의 처리는 write()의 처리보다 더 특별하지 않습니다.

모든 유형의 파일에 대해 원자성을 보장하지는 않습니다. 찾아보세요 PIPE_BUF. 파이프에 이 양보다 더 많이 쓰면 다른 쓰기와 인터리브될 수 있습니다.

f_pos현재 의 영향을 받는 일반 파일의 경우 f_pos_lock이 사례를 f_pos의 원자적 읽기 및 업데이트로 처리한 다음 pwritev().

이 보호는 비교적 새로운 "수정"(2014)입니다. 그 전에는 Linux가 POSIX를 위반하고 "아무도 신경 쓰지 않는" 때가 있었습니다. Linux 프로그램에서 이 보장에 의존한다면 다소 특이한 일을 하고 있는 것 같습니다 :).

POSIX의 소켓에는 아무런 보장이 없는 것 같습니다. 메일링 리스트 토론을 보면 Linux도 검색 가능한 장치 파일에 대해 이러한 보장을 제공할 수 있을 것 같습니다. tty처럼 검색할 수 없는 것에 대해 보장을 받을 수 있을지 확신할 수 없습니다.

writev()의 원자성이 어떻게 보장되는지 알고 싶습니다.

답변1

답변2

답변3

관련 정보