【问题标题】:Why does the Linux kernel have `struct sock` and `struct socket`?为什么Linux内核有`struct sock`和`struct socket`?
【发布时间】:2017-09-19 08:49:06
【问题描述】:

这个问题是网上asked before,但是我没找到好的答案。

Linux 内核网络堆栈具有两种结构:

这两个结构本质上是相互关联的,但生命周期似乎略有不同。可以通过sock->sk 找到sk,或者通过sk->sk_socket 找到sock

为什么有两种结构来存储有关套接字的信息?假设我需要添加一个新字段,我何时将其添加到 struct socket 以及何时添加到 struct sock

更新:请注意,我在 Linux 源代码中的 include/linux/net.h 中引用了 struct socket,这仅用于内核代码,不是 @987654336 @ 用于用户空间。

【问题讨论】:

  • 嗯:有内部和外部。 (类似的事情发生在 struct stat 中)

标签: sockets linux-kernel


【解决方案1】:

struct socket 似乎是用于系统调用的更高级别的接口(这就是为什么它还有指向 struct file 的指针,它在这里表示文件描述符)。

struct sockAF_INET 套接字的内核实现(还有struct unix_sock 用于AF_UNIX 套接字,这是它的派生),它可以被内核和用户空间使用(通过struct sock )。

两者都是在 1993 年被添加到 Linux 1.0 中的,我怀疑你会找到一个说明初始设计决策的文档。

【讨论】:

  • 好吧,但是为什么内核开发人员不直接在struct socket 中包含struct sockstruct unix_sock 中的所有字段?
  • @user1202136:因为它与 decomposition 相矛盾,这使得事情变得更简单。另外,请注意AF_INETAF_UNIX 仅是 43 种(请参阅AF_MAX)可用套接字类型中的 2 种。而struct socketsocket()在创建底层实现结构之前创建的。
【解决方案2】:

“这两个结构本质上是相互关联的”——不确定你的意思。

如果查看这些结构的源文件,我想你会找到答案:

socket  -> linux-src/include/linux/net.h
sock    -> linux-src/include/net/sock.h

套接字

 * NET      An implementation of the SOCKET network access protocol.
 *      This is the master header file for the Linux NET layer,
 *      or, in plain English: the networking handling part of the
 *      kernel.

袜子

 * INET     An implementation of the TCP/IP protocol suite for the LINUX
 *      operating system.  INET is implemented using the  BSD Socket
 *      interface as the means of communication with the user level.

这些结构不同,对套接字抽象的表示也不同。

这里回答不同的套接字。

Unix vs BSD vs TCP vs Internet sockets?

在哪里定义附加字段取决于您的意图。请描述你的任务。

请查看来源。

linux-src/include/linux/net.h

/*
 * NET      An implementation of the SOCKET network access protocol.
 *      This is the master header file for the Linux NET layer,
 *      or, in plain English: the networking handling part of the
 *      kernel.
 *
 * Version: @(#)net.h   1.0.3   05/25/93
 *
 * Authors: Orest Zborowski, <obz@Kodak.COM>
 *      Ross Biro
 *      Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG>
 *
 *      This program is free software; you can redistribute it and/or
 *      modify it under the terms of the GNU General Public License
 *      as published by the Free Software Foundation; either version
 *      2 of the License, or (at your option) any later version.
 */
.....
.....
.....
/**
 *  struct socket - general BSD socket
 *  @state: socket state (%SS_CONNECTED, etc)
 *  @type: socket type (%SOCK_STREAM, etc)
 *  @flags: socket flags (%SOCK_NOSPACE, etc)
 *  @ops: protocol specific socket operations
 *  @file: File back pointer for gc
 *  @sk: internal networking protocol agnostic socket representation
 *  @wq: wait queue for several uses
 */
struct socket {
    socket_state        state;

    kmemcheck_bitfield_begin(type);
    short           type;
    kmemcheck_bitfield_end(type);

    unsigned long       flags;

    struct socket_wq __rcu  *wq;

    struct file     *file;
    struct sock     *sk;
    const struct proto_ops  *ops;
};

linux-src/include/net/sock.h

/*
 * INET     An implementation of the TCP/IP protocol suite for the LINUX
 *      operating system.  INET is implemented using the  BSD Socket
 *      interface as the means of communication with the user level.
 *
 *      Definitions for the AF_INET socket handler.
 *
 * Version: @(#)sock.h  1.0.4   05/13/93
 *
 * Authors: Ross Biro
 *      Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG>
 *      Corey Minyard <wf-rch!minyard@relay.EU.net>
 *      Florian La Roche <flla@stud.uni-sb.de>
 *
 * Fixes:
 *      Alan Cox    :   Volatiles in skbuff pointers. See
 *                  skbuff comments. May be overdone,
 *                  better to prove they can be removed
 *                  than the reverse.
 *      Alan Cox    :   Added a zapped field for tcp to note
 *                  a socket is reset and must stay shut up
 *      Alan Cox    :   New fields for options
 *  Pauline Middelink   :   identd support
 *      Alan Cox    :   Eliminate low level recv/recvfrom
 *      David S. Miller :   New socket lookup architecture.
 *              Steve Whitehouse:       Default routines for sock_ops
 *              Arnaldo C. Melo :   removed net_pinfo, tp_pinfo and made
 *                          protinfo be just a void pointer, as the
 *                          protocol specific parts were moved to
 *                          respective headers and ipv4/v6, etc now
 *                          use private slabcaches for its socks
 *              Pedro Hortas    :   New flags field for socket options
 *
 *
 *      This program is free software; you can redistribute it and/or
 *      modify it under the terms of the GNU General Public License
 *      as published by the Free Software Foundation; either version
 *      2 of the License, or (at your option) any later version.
 */
....
....
....
/**
  * struct sock - network layer representation of sockets
  * @__sk_common: shared layout with inet_timewait_sock
  * @sk_shutdown: mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
  * @sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
  * @sk_lock:   synchronizer
  * @sk_kern_sock: True if sock is using kernel lock classes
  * @sk_rcvbuf: size of receive buffer in bytes
  * @sk_wq: sock wait queue and async head
  * @sk_rx_dst: receive input route used by early demux
  * @sk_dst_cache: destination cache
  * @sk_dst_pending_confirm: need to confirm neighbour
  * @sk_policy: flow policy
  * @sk_receive_queue: incoming packets
  * @sk_wmem_alloc: transmit queue bytes committed
  * @sk_tsq_flags: TCP Small Queues flags
  * @sk_write_queue: Packet sending queue
  * @sk_omem_alloc: "o" is "option" or "other"
  * @sk_wmem_queued: persistent queue size
  * @sk_forward_alloc: space allocated forward
  * @sk_napi_id: id of the last napi context to receive data for sk
  * @sk_ll_usec: usecs to busypoll when there is no data
  * @sk_allocation: allocation mode
  * @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
  * @sk_pacing_status: Pacing status (requested, handled by sch_fq)
  * @sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
  * @sk_sndbuf: size of send buffer in bytes
  * @__sk_flags_offset: empty field used to determine location of bitfield
  * @sk_padding: unused element for alignment
  * @sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
  * @sk_no_check_rx: allow zero checksum in RX packets
  * @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
  * @sk_route_nocaps: forbidden route capabilities (e.g NETIF_F_GSO_MASK)
  * @sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
  * @sk_gso_max_size: Maximum GSO segment size to build
  * @sk_gso_max_segs: Maximum number of GSO segments
  * @sk_lingertime: %SO_LINGER l_linger setting
  * @sk_backlog: always used with the per-socket spinlock held
  * @sk_callback_lock: used with the callbacks in the end of this struct
  * @sk_error_queue: rarely used
  * @sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
  *           IPV6_ADDRFORM for instance)
  * @sk_err: last error
  * @sk_err_soft: errors that don't cause failure but are the cause of a
  *           persistent failure not just 'timed out'
  * @sk_drops: raw/udp drops counter
  * @sk_ack_backlog: current listen backlog
  * @sk_max_ack_backlog: listen backlog set in listen()
  * @sk_uid: user id of owner
  * @sk_priority: %SO_PRIORITY setting
  * @sk_type: socket type (%SOCK_STREAM, etc)
  * @sk_protocol: which protocol this socket belongs in this network family
  * @sk_peer_pid: &struct pid for this socket's peer
  * @sk_peer_cred: %SO_PEERCRED setting
  * @sk_rcvlowat: %SO_RCVLOWAT setting
  * @sk_rcvtimeo: %SO_RCVTIMEO setting
  * @sk_sndtimeo: %SO_SNDTIMEO setting
  * @sk_txhash: computed flow hash for use on transmit
  * @sk_filter: socket filtering instructions
  * @sk_timer: sock cleanup timer
  * @sk_stamp: time stamp of last packet received
  * @sk_tsflags: SO_TIMESTAMPING socket options
  * @sk_tskey: counter to disambiguate concurrent tstamp requests
  * @sk_zckey: counter to order MSG_ZEROCOPY notifications
  * @sk_socket: Identd and reporting IO signals
  * @sk_user_data: RPC layer private data
  * @sk_frag: cached page frag
  * @sk_peek_off: current peek_offset value
  * @sk_send_head: front of stuff to transmit
  * @sk_security: used by security modules
  * @sk_mark: generic packet mark
  * @sk_cgrp_data: cgroup data for this cgroup
  * @sk_memcg: this socket's memory cgroup association
  * @sk_write_pending: a write to stream socket waits to start
  * @sk_state_change: callback to indicate change in the state of the sock
  * @sk_data_ready: callback to indicate there is data to be processed
  * @sk_write_space: callback to indicate there is bf sending space available
  * @sk_error_report: callback to indicate errors (e.g. %MSG_ERRQUEUE)
  * @sk_backlog_rcv: callback to process the backlog
  * @sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
  * @sk_reuseport_cb: reuseport group container
  * @sk_rcu: used during RCU grace period
  */
struct sock {
    /*
     * Now struct inet_timewait_sock also uses sock_common, so please just
     * don't add nothing before this first member (__sk_common) --acme
     */
    struct sock_common  __sk_common;
#define sk_node         __sk_common.skc_node
#define sk_nulls_node       __sk_common.skc_nulls_node
#define sk_refcnt       __sk_common.skc_refcnt
#define sk_tx_queue_mapping __sk_common.skc_tx_queue_mapping

#define sk_dontcopy_begin   __sk_common.skc_dontcopy_begin
#define sk_dontcopy_end     __sk_common.skc_dontcopy_end
#define sk_hash         __sk_common.skc_hash
#define sk_portpair     __sk_common.skc_portpair
#define sk_num          __sk_common.skc_num
#define sk_dport        __sk_common.skc_dport
#define sk_addrpair     __sk_common.skc_addrpair
#define sk_daddr        __sk_common.skc_daddr
#define sk_rcv_saddr        __sk_common.skc_rcv_saddr
#define sk_family       __sk_common.skc_family
#define sk_state        __sk_common.skc_state
#define sk_reuse        __sk_common.skc_reuse
#define sk_reuseport        __sk_common.skc_reuseport
#define sk_ipv6only     __sk_common.skc_ipv6only
#define sk_net_refcnt       __sk_common.skc_net_refcnt
#define sk_bound_dev_if     __sk_common.skc_bound_dev_if
#define sk_bind_node        __sk_common.skc_bind_node
#define sk_prot         __sk_common.skc_prot
#define sk_net          __sk_common.skc_net
#define sk_v6_daddr     __sk_common.skc_v6_daddr
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
#define sk_cookie       __sk_common.skc_cookie
#define sk_incoming_cpu     __sk_common.skc_incoming_cpu
#define sk_flags        __sk_common.skc_flags
#define sk_rxhash       __sk_common.skc_rxhash

    socket_lock_t       sk_lock;
    atomic_t        sk_drops;
    int         sk_rcvlowat;
    struct sk_buff_head sk_error_queue;
    struct sk_buff_head sk_receive_queue;
    /*
     * The backlog queue is special, it is always used with
     * the per-socket spinlock held and requires low latency
     * access. Therefore we special case it's implementation.
     * Note : rmem_alloc is in this structure to fill a hole
     * on 64bit arches, not because its logically part of
     * backlog.
     */
    struct {
        atomic_t    rmem_alloc;
        int     len;
        struct sk_buff  *head;
        struct sk_buff  *tail;
    } sk_backlog;
#define sk_rmem_alloc sk_backlog.rmem_alloc

    int         sk_forward_alloc;
#ifdef CONFIG_NET_RX_BUSY_POLL
    unsigned int        sk_ll_usec;
    /* ===== mostly read cache line ===== */
    unsigned int        sk_napi_id;
#endif
    int         sk_rcvbuf;

    struct sk_filter __rcu  *sk_filter;
    union {
        struct socket_wq __rcu  *sk_wq;
        struct socket_wq    *sk_wq_raw;
    };
#ifdef CONFIG_XFRM
    struct xfrm_policy __rcu *sk_policy[2];
#endif
    struct dst_entry    *sk_rx_dst;
    struct dst_entry __rcu  *sk_dst_cache;
    atomic_t        sk_omem_alloc;
    int         sk_sndbuf;

    /* ===== cache line for TX ===== */
    int         sk_wmem_queued;
    refcount_t      sk_wmem_alloc;
    unsigned long       sk_tsq_flags;
    struct sk_buff      *sk_send_head;
    struct sk_buff_head sk_write_queue;
    __s32           sk_peek_off;
    int         sk_write_pending;
    __u32           sk_dst_pending_confirm;
    u32         sk_pacing_status; /* see enum sk_pacing */
    long            sk_sndtimeo;
    struct timer_list   sk_timer;
    __u32           sk_priority;
    __u32           sk_mark;
    u32         sk_pacing_rate; /* bytes per second */
    u32         sk_max_pacing_rate;
    struct page_frag    sk_frag;
    netdev_features_t   sk_route_caps;
    netdev_features_t   sk_route_nocaps;
    int         sk_gso_type;
    unsigned int        sk_gso_max_size;
    gfp_t           sk_allocation;
    __u32           sk_txhash;

    /*
     * Because of non atomicity rules, all
     * changes are protected by socket lock.
     */
    unsigned int        __sk_flags_offset[0];
#ifdef __BIG_ENDIAN_BITFIELD
#define SK_FL_PROTO_SHIFT  16
#define SK_FL_PROTO_MASK   0x00ff0000

#define SK_FL_TYPE_SHIFT   0
#define SK_FL_TYPE_MASK    0x0000ffff
#else
#define SK_FL_PROTO_SHIFT  8
#define SK_FL_PROTO_MASK   0x0000ff00

#define SK_FL_TYPE_SHIFT   16
#define SK_FL_TYPE_MASK    0xffff0000
#endif

    kmemcheck_bitfield_begin(flags);
    unsigned int        sk_padding : 1,
                sk_kern_sock : 1,
                sk_no_check_tx : 1,
                sk_no_check_rx : 1,
                sk_userlocks : 4,
                sk_protocol  : 8,
                sk_type      : 16;
#define SK_PROTOCOL_MAX U8_MAX
    kmemcheck_bitfield_end(flags);

    u16         sk_gso_max_segs;
    unsigned long           sk_lingertime;
    struct proto        *sk_prot_creator;
    rwlock_t        sk_callback_lock;
    int         sk_err,
                sk_err_soft;
    u32         sk_ack_backlog;
    u32         sk_max_ack_backlog;
    kuid_t          sk_uid;
    struct pid      *sk_peer_pid;
    const struct cred   *sk_peer_cred;
    long            sk_rcvtimeo;
    ktime_t         sk_stamp;
    u16         sk_tsflags;
    u8          sk_shutdown;
    u32         sk_tskey;
    atomic_t        sk_zckey;
    struct socket       *sk_socket;
    void            *sk_user_data;
#ifdef CONFIG_SECURITY
    void            *sk_security;
#endif
    struct sock_cgroup_data sk_cgrp_data;
    struct mem_cgroup   *sk_memcg;
    void            (*sk_state_change)(struct sock *sk);
    void            (*sk_data_ready)(struct sock *sk);
    void            (*sk_write_space)(struct sock *sk);
    void            (*sk_error_report)(struct sock *sk);
    int         (*sk_backlog_rcv)(struct sock *sk,
                          struct sk_buff *skb);
    void                    (*sk_destruct)(struct sock *sk);
    struct sock_reuseport __rcu *sk_reuseport_cb;
    struct rcu_head     sk_rcu;
};

【讨论】:

  • 复制粘贴不解释的源代码真的没有帮助。我没有具体的任务,只是想了解背后的逻辑。例如,为什么sk_uidstruct sock 中而不是struct socket
猜你喜欢
  • 2017-10-15
  • 2011-11-14
  • 2016-05-16
  • 1970-01-01
  • 2020-07-17
  • 2014-06-17
  • 1970-01-01
  • 2014-10-28
  • 1970-01-01
相关资源
最近更新 更多