sk_buff 定義及其操作 (封包,數據包,packet …)

(出處: 333)

1. sk_buff 結構體
可以看出sk_buff 結構體很重要,
sk_buff — 套接字緩衝區,用來在linux網絡子系統中各層之間數據傳遞,起到了“神經中樞”的作用。
當發送數據包時,linux內核的網絡模塊必須建立一個包含要傳輸的數據包的sk_buff,然後將sk_buff傳遞給下一層,各層在sk_buff 中添加不同的協議頭,直到交給網絡設備發送。同樣,當接收數據包時,網絡設備從物理媒介層接收到數據後,他必須將接收到的數據轉換為sk_buff,並傳遞給上層,各層剝去相應的協議頭後直到交給用戶。
sk_buff結構如下圖所示: (define at include/linux/skbuff.h)

sk_buff定義如​​下:


/** 
 *	struct sk_buff - socket buffer
 *	@next: Next buffer in list
 *	@prev: Previous buffer in list
 *	@tstamp: Time we arrived
 *	@sk: Socket we are owned by
 *	@dev: Device we arrived on/are leaving by
 *	@cb: Control buffer. Free for use by every layer. Put private vars here
 *	@_skb_refdst: destination entry (with norefcount bit)
 *	@sp: the security path, used for xfrm
 *	@len: Length of actual data
 *	@data_len: Data length
 *	@mac_len: Length of link layer header
 *	@hdr_len: writable header length of cloned skb
 *	@csum: Checksum (must include start/offset pair)
 *	@csum_start: Offset from skb->head where checksumming should start
 *	@csum_offset: Offset from csum_start where checksum should be stored
 *	@priority: Packet queueing priority
 *	@local_df: allow local fragmentation
 *	@cloned: Head may be cloned (check refcnt to be sure)
 *	@ip_summed: Driver fed us an IP checksum
 *	@nohdr: Payload reference only, must not modify header
 *	@nfctinfo: Relationship of this skb to the connection
 *	@pkt_type: Packet class
 *	@fclone: skbuff clone status
 *	@ipvs_property: skbuff is owned by ipvs
 *	@peeked: this packet has been seen already, so stats have been
 *		done for it, don't do them again
 *	@nf_trace: netfilter packet trace flag
 *	@protocol: Packet protocol from driver
 *	@destructor: Destruct function
 *	@nfct: Associated connection, if any
 *	@nfct_reasm: netfilter conntrack re-assembly pointer
 *	@nf_bridge: Saved data about a bridged frame - see br_netfilter.c
 *	@skb_iif: ifindex of device we arrived on
 *	@tc_index: Traffic control index
 *	@tc_verd: traffic control verdict
 *	@rxhash: the packet hash computed on receive
 *	@queue_mapping: Queue mapping for multiqueue devices
 *	@ndisc_nodetype: router type (from link layer)
 *	@ooo_okay: allow the mapping of a socket to a queue to be changed
 *	@l4_rxhash: indicate rxhash is a canonical 4-tuple hash over transport
 *		ports.
 *	@wifi_acked_valid: wifi_acked was set
 *	@wifi_acked: whether frame was acked on wifi or not
 *	@no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS
 *	@dma_cookie: a cookie to one of several possible DMA operations
 *		done by skb DMA functions
 *	@secmark: security marking
 *	@mark: Generic packet mark
 *	@dropcount: total number of sk_receive_queue overflows
 *	@vlan_tci: vlan tag control information
 *	@transport_header: Transport layer header
 *	@network_header: Network layer header
 *	@mac_header: Link layer header
 *	@tail: Tail pointer
 *	@end: End pointer
 *	@head: Head of buffer
 *	@data: Data head pointer
 *	@truesize: Buffer size
 *	@users: User count - see {datagram,tcp}.c
 */

struct sk_buff {
	/* These two members must be first. */
	struct sk_buff		*next;
	struct sk_buff		*prev;

	ktime_t			tstamp;

	struct sock		*sk;
	struct net_device	*dev;

	/*
	 * This is the control buffer. It is free to use for every
	 * layer. Please put your private variables there. If you
	 * want to keep them across layers you have to do a skb_clone()
	 * first. This is owned by whoever has the skb queued ATM.
	 */
	char			cb[48] __aligned(8);

	unsigned long		_skb_refdst;
#ifdef CONFIG_XFRM
	struct	sec_path	*sp;
#endif
	unsigned int		len,
				data_len;
	__u16			mac_len,
				hdr_len;
	union {
		__wsum		csum;
		struct {
			__u16	csum_start;
			__u16	csum_offset;
		};
	};
	__u32			priority;
	kmemcheck_bitfield_begin(flags1);
	__u8			local_df:1,
				cloned:1,
				ip_summed:2,
				nohdr:1,
				nfctinfo:3;
	__u8			pkt_type:3,
				fclone:2,
				ipvs_property:1,
				peeked:1,
				nf_trace:1;
	kmemcheck_bitfield_end(flags1);
	__be16			protocol;

	void			(*destructor)(struct sk_buff *skb);
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
	struct nf_conntrack	*nfct;
#endif
#ifdef NET_SKBUFF_NF_DEFRAG_NEEDED
	struct sk_buff		*nfct_reasm;
#endif
#ifdef CONFIG_BRIDGE_NETFILTER
	struct nf_bridge_info	*nf_bridge;
#endif

	int			skb_iif;

	__u32			rxhash;

	__u16			vlan_tci;

#ifdef CONFIG_NET_SCHED
	__u16			tc_index;	/* traffic control index */
#ifdef CONFIG_NET_CLS_ACT
	__u16			tc_verd;	/* traffic control verdict */
#endif
#endif

	__u16			queue_mapping;
	kmemcheck_bitfield_begin(flags2);
#ifdef CONFIG_IPV6_NDISC_NODETYPE
	__u8			ndisc_nodetype:2;
#endif
	__u8			ooo_okay:1;
	__u8			l4_rxhash:1;
	__u8			wifi_acked_valid:1;
	__u8			wifi_acked:1;
	__u8			no_fcs:1;
	/* 9/11 bit hole (depending on ndisc_nodetype presence) */
	kmemcheck_bitfield_end(flags2);

#ifdef CONFIG_NET_DMA
	dma_cookie_t		dma_cookie;
#endif
#ifdef CONFIG_NETWORK_SECMARK
	__u32			secmark;
#endif
	union {
		__u32		mark;
		__u32		dropcount;
		__u32		avail_size;
	};

	sk_buff_data_t		transport_header;
	sk_buff_data_t		network_header;
	sk_buff_data_t		mac_header;
	/* These elements must be at the end, see alloc_skb() for details.  */
	sk_buff_data_t		tail;
	sk_buff_data_t		end;
	unsigned char		*head,
				*data;
	unsigned int		truesize;
	atomic_t		users;
};


sk_buff主要成員如下:
1.1 各層協議頭:
— transport_header : 傳輸層協議頭,如TCP, UDP , ICMP, IGMP等協議頭
— network_header : 網絡層協議頭, 如IP, IPv6, ARP 協議頭
— mac_header : 鏈路層協議頭。
— sk_buff_data_t 原型就是一個char 指針

#ifdef NET_SKBUFF_DATA_USES_OFFSET
typedef unsigned int sk_buff_data_t;
#else
typedef unsigned char *sk_buff_data_t;
#endif

1.2 數據緩衝區指針head, data, tail, end
— *head :指向內存中已分配的用於存放網絡數據緩衝區的起始地址, sk_buff和相關數據被分配後,該指針值就固定了
— *data : 指向對應當前協議層有效數據的起始地址。
每個協議層的有效數據內容不一樣,各層有效數據的內容如下:
a. 對於傳輸層,有效數據包括用戶數據和傳輸層協議頭
b. 對於網絡層,有效數據包括用戶數據、傳輸層協議和網絡層協議頭。
c. 對於數據鏈路層,有效數據包括用戶數據、傳輸層協議、網絡層協議和鏈路層協議。
因此,data指針隨著當前擁有sk_buff的協議層的變化而進行相應的移動。
— tail :指向對應當前協議層有效數據的結尾地址,與data指針相對應。
— end :指向內存中分配的網絡數據緩衝區的結尾,與head指針相對應。和head一樣,sk_buff被分配後,end指針就固定了。
head, data, tail, end 關係如下圖所示:

1.3 長度信息len, data_len, truesize
— len :指網絡數據包的有效數據的長度,包括協議頭和負載(payload).
— data_len : 記錄分片的數據長度
— truesize :表述緩存區的整體長度,一般為sizeof(sk_buff).
1.4 數據包類型
— pkt_type :指定數據包類型。驅動程序負責將其設置為:
PACKET_HOST — 該數據包是給我的。
PACKET_OTHERHOST — 該數據包不是給我的。
PACKET_BROADCAST — 廣播類型的數據包
PACKET_MULTICAST — 組播類型的數據包
驅動程序不必顯式的修改pkt_type,因為eth_type_trans會完成該工作。
2. 套接字緩衝區的操作
2.1 分配套接字緩衝區
struct sk_buff *alloc_skb(unsigned intlen, int priority);
alloc_skb()函數分配一個套接字緩衝區和一個數據緩衝區。
— len : 為數據緩衝區的大小
— priority : 內存分配的優先級

static inline struct sk_buff *alloc_skb(unsigned int size,
					gfp_t priority)
{
	return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
}

/**
 *	__alloc_skb	-	allocate a network buffer
 *	@size: size to allocate
 *	@gfp_mask: allocation mask
 *	@fclone: allocate from fclone cache instead of head cache
 *		and allocate a cloned (child) skb
 *	@node: numa node to allocate memory on
 *
 *	Allocate a new &sk_buff. The returned buffer has no headroom and a
 *	tail room of size bytes. The object has a reference count of one.
 *	The return is the buffer. On a failure the return is %NULL.
 *
 *	Buffers may only be allocated from interrupts using a @gfp_mask of
 *	%GFP_ATOMIC.
 */
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
			    int fclone, int node)
{
	struct kmem_cache *cache;
	struct skb_shared_info *shinfo;
	struct sk_buff *skb;
	u8 *data;

	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

	/* Get the HEAD */
	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
	if (!skb)
		goto out;
	prefetchw(skb);

	/* We do our best to align skb_shared_info on a separate cache
	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
	 * Both skb->head and skb_shared_info are cache line aligned.
	 */
	size = SKB_DATA_ALIGN(size);
	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
	data = kmalloc_node_track_caller(size, gfp_mask, node);
	if (unlikely(ZERO_OR_NULL_PTR(data)))
		goto nodata;
	/* kmalloc(size) might give us more room than requested.
	 * Put skb_shared_info exactly at the end of allocated zone,
	 * to allow max possible filling before reallocation.
	 */
	size = SKB_WITH_OVERHEAD(ksize(data));
	prefetchw(data + size);

	/*
	 * Only clear those fields we need to clear, not those that we will
	 * actually initialise below. Hence, don't put any more fields after
	 * the tail pointer in struct sk_buff!
	 */
	memset(skb, 0, offsetof(struct sk_buff, tail));
	/* Account for allocated memory : skb + skb->head */
	skb->truesize = SKB_TRUESIZE(size);
	atomic_set(&skb->users, 1);
	skb->head = data;
	skb->data = data;
	skb_reset_tail_pointer(skb);
	skb->end = skb->tail + size;
#ifdef NET_SKBUFF_DATA_USES_OFFSET
	skb->mac_header = ~0U;
#endif

	/* make sure we initialize shinfo sequentially */
	shinfo = skb_shinfo(skb);
	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
	atomic_set(&shinfo->dataref, 1);
	kmemcheck_annotate_variable(shinfo->destructor_arg);

	if (fclone) {
		struct sk_buff *child = skb + 1;
		atomic_t *fclone_ref = (atomic_t *) (child + 1);

		kmemcheck_annotate_bitfield(child, flags1);
		kmemcheck_annotate_bitfield(child, flags2);
		skb->fclone = SKB_FCLONE_ORIG;
		atomic_set(fclone_ref, 1);

		child->fclone = SKB_FCLONE_UNAVAILABLE;
	}
out:
	return skb;
nodata:
	kmem_cache_free(cache, skb);
	skb = NULL;
	goto out;
}
EXPORT_SYMBOL(__alloc_skb);

struct sk_buff *dev_alloc_skb(unsignedint len​​);
dev_alloc_skb()函數以GFP_ATOMIC 優先級調用上面的alloc_skb()函數。
並保存skb->dead 和skb->data之間的16個字節

/**
 *	dev_alloc_skb - allocate an skbuff for receiving
 *	@length: length to allocate
 *
 *	Allocate a new &sk_buff and assign it a usage count of one. The
 *	buffer has unspecified headroom built in. Users should allocate
 *	the headroom they think they need without accounting for the
 *	built in space. The built in space is used for optimisations.
 *
 *	%NULL is returned if there is no free memory. Although this function
 *	allocates memory it can be called from an interrupt.
 */
struct sk_buff *dev_alloc_skb(unsigned int length)
{
	/*
	 * There is more code here than it seems:
	 * __dev_alloc_skb is an inline
	 */
	return __dev_alloc_skb(length, GFP_ATOMIC);
}
EXPORT_SYMBOL(dev_alloc_skb);
/**
 *	__dev_alloc_skb - allocate an skbuff for receiving
 *	@length: length to allocate
 *	@gfp_mask: get_free_pages mask, passed to alloc_skb
 *
 *	Allocate a new &sk_buff and assign it a usage count of one. The
 *	buffer has unspecified headroom built in. Users should allocate
 *	the headroom they think they need without accounting for the
 *	built in space. The built in space is used for optimisations.
 *
 *	%NULL is returned if there is no free memory.
 */
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
					      gfp_t gfp_mask)
{
	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
	if (likely(skb))
		skb_reserve(skb, NET_SKB_PAD);
	return skb;
}

2.2 釋放套接字緩衝區
void kfree_skb(struct sk_buff *skb);

/**
 *	kfree_skb - free an sk_buff
 *	@skb: buffer to free
 *
 *	Drop a reference to the buffer and free it if the usage count has
 *	hit zero.
 */
void kfree_skb(struct sk_buff *skb)
{
	if (unlikely(!skb))
		return;
	if (likely(atomic_read(&skb->users) == 1))
		smp_rmb();
	else if (likely(!atomic_dec_and_test(&skb->users)))
		return;
	trace_kfree_skb(skb, __builtin_return_address(0));
	__kfree_skb(skb);
}
EXPORT_SYMBOL(kfree_skb);

— kfree_skb() 函數只能在內核內部使用,網絡設備驅動中必須使用dev_kfree_skb()、dev_kfree_skb_irq() 或dev_kfree_skb_any().
void dev_kfree_skb(struct sk_buff *skb);
— dev_kfree_skb()用於非中斷上下文。

#define dev_kfree_skb(a)	consume_skb(a)

/**
 *	consume_skb - free an skbuff
 *	@skb: buffer to free
 *
 *	Drop a ref to the buffer and free it if the usage count has hit zero
 *	Functions identically to kfree_skb, but kfree_skb assumes that the frame
 *	is being dropped after a failure and notes that
 */
void consume_skb(struct sk_buff *skb)
{
	if (unlikely(!skb))
		return;
	if (likely(atomic_read(&skb->users) == 1))
		smp_rmb();
	else if (likely(!atomic_dec_and_test(&skb->users)))
		return;
	trace_consume_skb(skb);
	__kfree_skb(skb);
}
EXPORT_SYMBOL(consume_skb);

void dev_kfree_skb_irq(struct sk_buff *skb);
— dev_kfree_skb_irq() 用於中斷上下文。

void dev_kfree_skb_irq(struct sk_buff *skb)
{
	if (atomic_dec_and_test(&skb->users)) {
		struct softnet_data *sd;
		unsigned long flags;

		local_irq_save(flags);
		sd = &__get_cpu_var(softnet_data);
		skb->next = sd->completion_queue;
		sd->completion_queue = skb;
		raise_softirq_irqoff(NET_TX_SOFTIRQ);
		local_irq_restore(flags);
	}
}
EXPORT_SYMBOL(dev_kfree_skb_irq);

void dev_kfree_skb_any(struct sk_buff *skb);
— dev_kfree_skb_any() 在中斷或非中斷上下文中都能使用。

void dev_kfree_skb_any(struct sk_buff *skb)
{
	if (in_irq() || irqs_disabled())
		dev_kfree_skb_irq(skb);
	else
		dev_kfree_skb(skb);
}
EXPORT_SYMBOL(dev_kfree_skb_any);

2.3移動指針
Linux套接字緩衝區中的指針移動操作有:put(放置), push(推), pull(拉)和reserve(保留)等。
2.3.1 put操作
unsigned char *skb_put(struct sk_buff *skb, unsigned int len​​);
將tail 指針下移,增加sk_buff 的len 值,並返回skb->tail 的當前值。
將數據添加在buffer的尾部。


/**
 *	skb_put - add data to a buffer
 *	@skb: buffer to use
 *	@len: amount of data to add
 *
 *	This function extends the used data area of the buffer. If this would
 *	exceed the total buffer size the kernel will panic. A pointer to the
 *	first byte of the extra data is returned.
 */
unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
{
	unsigned char *tmp = skb_tail_pointer(skb);
	SKB_LINEAR_ASSERT(skb);
	skb->tail += len;
	skb->len  += len;
	if (unlikely(skb->tail > skb->end))
		skb_over_panic(skb, len, __builtin_return_address(0));
	return tmp;
}
EXPORT_SYMBOL(skb_put);
static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb)
{
	return skb->tail;
}

unsigned char *__skb_put(struct sk_buff *skb, unsigned int len​​);
__skb_put() 與skb_put()的區別在於skb_put()會檢測放入緩衝區的數據, 而__skb_put()不會檢查

static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
{
	unsigned char *tmp = skb_tail_pointer(skb);
	SKB_LINEAR_ASSERT(skb);
	skb->tail += len;
	skb->len  += len;
	return tmp;
}

2.3.2 push操作:
unsigned char *skb_push(struct sk_buff *skb, unsigned int len​​);
skb_push()會將data指針上移,也就是將數據添加在buffer的起始點,因此也要增加sk_buff的len值。


/**
 *	skb_push - add data to the start of a buffer
 *	@skb: buffer to use
 *	@len: amount of data to add
 *
 *	This function extends the used data area of the buffer at the buffer
 *	start. If this would exceed the total buffer headroom the kernel will
 *	panic. A pointer to the first byte of the extra data is returned.
 */
unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
{
	skb->data -= len;
	skb->len  += len;
	if (unlikely(skb->datahead))
		skb_under_panic(skb, len, __builtin_return_address(0));
	return skb->data;
}
EXPORT_SYMBOL(skb_push);

unsigned char *__skb_push(struct sk_buff *skb, unsigned int len​​);

static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
{
	skb->data -= len;
	skb->len  += len;
	return skb->data;
}

__skb_push()和skb_push()的區別與__skb_put() 和skb_put()的區別一樣。
push操作在緩衝區的頭部增加一段可以存儲網絡數據包的空間,而put操作在緩衝區的尾部增加一段可以存儲網絡數據包的空間。

2.3.3 pull操作:
unsigned char *skb_pull(struct sk_buff *skb, unsigned int len​​);
skb_pull()將data指針下移,並減少skb的len值, 這個操作與skb_push()對應。
這個操作主要用於下層協議向上層協議移交數據包,使data指針指向上一層協議頭


/**
 *	skb_pull - remove data from the start of a buffer
 *	@skb: buffer to use
 *	@len: amount of data to remove
 *
 *	This function removes data from the start of a buffer, returning
 *	the memory to the headroom. A pointer to the next data in the buffer
 *	is returned. Once the data has been pulled future pushes will overwrite
 *	the old data.
 */
unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
{
	return skb_pull_inline(skb, len);
}
EXPORT_SYMBOL(skb_pull);

static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len)
{
	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
}
static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
{
	skb->len -= len;
	BUG_ON(skb->len < skb->data_len);
	return skb->data += len;
}

2.3.4 reserve 操作
void skb_reserve(struct sk_buff *skb, unsigned int len​​);
skb_reserve()將data指針和tail 指針同時下移。
這個操作用於在緩衝區頭部預留len長度的空間

/**
 *	skb_reserve - adjust headroom
 *	@skb: buffer to alter
 *	@len: bytes to move
 *
 *	Increase the headroom of an empty &sk_buff by reducing the tail
 *	room. This is only allowed for an empty buffer.
 */
static inline void skb_reserve(struct sk_buff *skb, int len)
{
	skb->data += len;
	skb->tail += len;
}

3. 例子:
Linux處理一個UDP數據包的接收流程,來說明對sk_buff的操作過程。
這一過程絕大部分工作會在內核完成,驅動中只需要完成涉及數據鏈路層部分。
假設網卡收到一個UDP數據包,Linux處理流程如下:

3.1 網卡收到一個UDP數據包後,驅動程序需要創建一個sk_buff結構體和數據緩衝區,將接收到的數據全部複製到data指向的空間,並將skb->mac_header指向data。
此時有效數據的開始位置data是一個以太網頭部,即鏈路層協議頭。
示例代碼如下:
//分配新的套接字緩衝區和數據緩衝區

工作內容如下圖所示:

3.2 數據鏈路層通過調用skb_pull() 剝掉以太網協議頭,向網絡層IP傳送數據包。
在剝離過程中,data指針會下移一個以太網頭部的長度sizeof(struct ethhdr), 而len 也減去sizeof(struct ethhdr)長度。
此時有效數據的開始位置是一個IP協議頭,skb->network_head指向data,即IP協議頭, 而skb->mac_header 依舊指向以太網頭, 即鏈路層協議頭。
內容如下圖所示:

3.3 網絡層通過skb_pull()剝掉IP協議頭,向UDP傳輸層傳遞數據包。
剝離過程中,data指針會下移一個IP協議頭長度sizeof(struct iphdr), 而len也會減少sizeof(struct iphdr)長度。
此時有效數據開始位置是一個UDP協議頭, skb->transport_header指向data,即UDP協議頭。
而skb->network_header繼續指向IP協議頭, skb->mac_header 繼續指向鏈路層協議頭。
如下圖所示:

3.4 應用程序在調用recv() 接收數據時,從skb->data + sizeof(struct udphdr) 的位置開始復製到應用層緩衝區。
可見,UPD協議頭到最後也沒有被剝離。

發表迴響