(出處: 333)
1. sk_buff 結構體
可以看出sk_buff 結構體很重要,
sk_buff — 套接字緩衝區,用來在linux網絡子系統中各層之間數據傳遞,起到了“神經中樞”的作用。
當發送數據包時,linux內核的網絡模塊必須建立一個包含要傳輸的數據包的sk_buff,然後將sk_buff傳遞給下一層,各層在sk_buff 中添加不同的協議頭,直到交給網絡設備發送。同樣,當接收數據包時,網絡設備從物理媒介層接收到數據後,他必須將接收到的數據轉換為sk_buff,並傳遞給上層,各層剝去相應的協議頭後直到交給用戶。
sk_buff結構如下圖所示: (define at include/linux/skbuff.h)
sk_buff定義如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
/** * struct sk_buff - socket buffer * @next: Next buffer in list * @prev: Previous buffer in list * @tstamp: Time we arrived * @sk: Socket we are owned by * @dev: Device we arrived on/are leaving by * @cb: Control buffer. Free for use by every layer. Put private vars here * @_skb_refdst: destination entry (with norefcount bit) * @sp: the security path, used for xfrm * @len: Length of actual data * @data_len: Data length * @mac_len: Length of link layer header * @hdr_len: writable header length of cloned skb * @csum: Checksum (must include start/offset pair) * @csum_start: Offset from skb->head where checksumming should start * @csum_offset: Offset from csum_start where checksum should be stored * @priority: Packet queueing priority * @local_df: allow local fragmentation * @cloned: Head may be cloned (check refcnt to be sure) * @ip_summed: Driver fed us an IP checksum * @nohdr: Payload reference only, must not modify header * @nfctinfo: Relationship of this skb to the connection * @pkt_type: Packet class * @fclone: skbuff clone status * @ipvs_property: skbuff is owned by ipvs * @peeked: this packet has been seen already, so stats have been * done for it, don't do them again * @nf_trace: netfilter packet trace flag * @protocol: Packet protocol from driver * @destructor: Destruct function * @nfct: Associated connection, if any * @nfct_reasm: netfilter conntrack re-assembly pointer * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c * @skb_iif: ifindex of device we arrived on * @tc_index: Traffic control index * @tc_verd: traffic control verdict * @rxhash: the packet hash computed on receive * @queue_mapping: Queue mapping for multiqueue devices * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed * @l4_rxhash: indicate rxhash is a canonical 4-tuple hash over transport * ports. * @wifi_acked_valid: wifi_acked was set * @wifi_acked: whether frame was acked on wifi or not * @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS * @dma_cookie: a cookie to one of several possible DMA operations * done by skb DMA functions * @secmark: security marking * @mark: Generic packet mark * @dropcount: total number of sk_receive_queue overflows * @vlan_tci: vlan tag control information * @transport_header: Transport layer header * @network_header: Network layer header * @mac_header: Link layer header * @tail: Tail pointer * @end: End pointer * @head: Head of buffer * @data: Data head pointer * @truesize: Buffer size * @users: User count - see {datagram,tcp}.c */ struct sk_buff { /* These two members must be first. */ struct sk_buff *next; struct sk_buff *prev; ktime_t tstamp; struct sock *sk; struct net_device *dev; /* * This is the control buffer. It is free to use for every * layer. Please put your private variables there. If you * want to keep them across layers you have to do a skb_clone() * first. This is owned by whoever has the skb queued ATM. */ char cb[48] __aligned(8); unsigned long _skb_refdst; #ifdef CONFIG_XFRM struct sec_path *sp; #endif unsigned int len, data_len; __u16 mac_len, hdr_len; union { __wsum csum; struct { __u16 csum_start; __u16 csum_offset; }; }; __u32 priority; kmemcheck_bitfield_begin(flags1); __u8 local_df:1, cloned:1, ip_summed:2, nohdr:1, nfctinfo:3; __u8 pkt_type:3, fclone:2, ipvs_property:1, peeked:1, nf_trace:1; kmemcheck_bitfield_end(flags1); __be16 protocol; void (*destructor)(struct sk_buff *skb); #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct nf_conntrack *nfct; #endif #ifdef NET_SKBUFF_NF_DEFRAG_NEEDED struct sk_buff *nfct_reasm; #endif #ifdef CONFIG_BRIDGE_NETFILTER struct nf_bridge_info *nf_bridge; #endif int skb_iif; __u32 rxhash; __u16 vlan_tci; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #ifdef CONFIG_NET_CLS_ACT __u16 tc_verd; /* traffic control verdict */ #endif #endif __u16 queue_mapping; kmemcheck_bitfield_begin(flags2); #ifdef CONFIG_IPV6_NDISC_NODETYPE __u8 ndisc_nodetype:2; #endif __u8 ooo_okay:1; __u8 l4_rxhash:1; __u8 wifi_acked_valid:1; __u8 wifi_acked:1; __u8 no_fcs:1; /* 9/11 bit hole (depending on ndisc_nodetype presence) */ kmemcheck_bitfield_end(flags2); #ifdef CONFIG_NET_DMA dma_cookie_t dma_cookie; #endif #ifdef CONFIG_NETWORK_SECMARK __u32 secmark; #endif union { __u32 mark; __u32 dropcount; __u32 avail_size; }; sk_buff_data_t transport_header; sk_buff_data_t network_header; sk_buff_data_t mac_header; /* These elements must be at the end, see alloc_skb() for details. */ sk_buff_data_t tail; sk_buff_data_t end; unsigned char *head, *data; unsigned int truesize; atomic_t users; }; |
sk_buff主要成員如下:
1.1 各層協議頭:
— transport_header : 傳輸層協議頭,如TCP, UDP , ICMP, IGMP等協議頭
— network_header : 網絡層協議頭, 如IP, IPv6, ARP 協議頭
— mac_header : 鏈路層協議頭。
— sk_buff_data_t 原型就是一個char 指針
1 2 3 4 5 |
#ifdef NET_SKBUFF_DATA_USES_OFFSET typedef unsigned int sk_buff_data_t; #else typedef unsigned char *sk_buff_data_t; #endif |
1.2 數據緩衝區指針head, data, tail, end
— *head :指向內存中已分配的用於存放網絡數據緩衝區的起始地址, sk_buff和相關數據被分配後,該指針值就固定了
— *data : 指向對應當前協議層有效數據的起始地址。
每個協議層的有效數據內容不一樣,各層有效數據的內容如下:
a. 對於傳輸層,有效數據包括用戶數據和傳輸層協議頭
b. 對於網絡層,有效數據包括用戶數據、傳輸層協議和網絡層協議頭。
c. 對於數據鏈路層,有效數據包括用戶數據、傳輸層協議、網絡層協議和鏈路層協議。
因此,data指針隨著當前擁有sk_buff的協議層的變化而進行相應的移動。
— tail :指向對應當前協議層有效數據的結尾地址,與data指針相對應。
— end :指向內存中分配的網絡數據緩衝區的結尾,與head指針相對應。和head一樣,sk_buff被分配後,end指針就固定了。
head, data, tail, end 關係如下圖所示:
1.3 長度信息len, data_len, truesize
— len :指網絡數據包的有效數據的長度,包括協議頭和負載(payload).
— data_len : 記錄分片的數據長度
— truesize :表述緩存區的整體長度,一般為sizeof(sk_buff).
1.4 數據包類型
— pkt_type :指定數據包類型。驅動程序負責將其設置為:
PACKET_HOST — 該數據包是給我的。
PACKET_OTHERHOST — 該數據包不是給我的。
PACKET_BROADCAST — 廣播類型的數據包
PACKET_MULTICAST — 組播類型的數據包
驅動程序不必顯式的修改pkt_type,因為eth_type_trans會完成該工作。
2. 套接字緩衝區的操作
2.1 分配套接字緩衝區
struct sk_buff *alloc_skb(unsigned intlen, int priority);
alloc_skb()函數分配一個套接字緩衝區和一個數據緩衝區。
— len : 為數據緩衝區的大小
— priority : 內存分配的優先級
1 2 3 4 5 |
static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority) { return __alloc_skb(size, priority, 0, NUMA_NO_NODE); } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
/** * __alloc_skb - allocate a network buffer * @size: size to allocate * @gfp_mask: allocation mask * @fclone: allocate from fclone cache instead of head cache * and allocate a cloned (child) skb * @node: numa node to allocate memory on * * Allocate a new &sk_buff. The returned buffer has no headroom and a * tail room of size bytes. The object has a reference count of one. * The return is the buffer. On a failure the return is %NULL. * * Buffers may only be allocated from interrupts using a @gfp_mask of * %GFP_ATOMIC. */ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone, int node) { struct kmem_cache *cache; struct skb_shared_info *shinfo; struct sk_buff *skb; u8 *data; cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; /* Get the HEAD */ skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); if (!skb) goto out; prefetchw(skb); /* We do our best to align skb_shared_info on a separate cache * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives * aligned memory blocks, unless SLUB/SLAB debug is enabled. * Both skb->head and skb_shared_info are cache line aligned. */ size = SKB_DATA_ALIGN(size); size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); data = kmalloc_node_track_caller(size, gfp_mask, node); if (unlikely(ZERO_OR_NULL_PTR(data))) goto nodata; /* kmalloc(size) might give us more room than requested. * Put skb_shared_info exactly at the end of allocated zone, * to allow max possible filling before reallocation. */ size = SKB_WITH_OVERHEAD(ksize(data)); prefetchw(data + size); /* * Only clear those fields we need to clear, not those that we will * actually initialise below. Hence, don't put any more fields after * the tail pointer in struct sk_buff! */ memset(skb, 0, offsetof(struct sk_buff, tail)); /* Account for allocated memory : skb + skb->head */ skb->truesize = SKB_TRUESIZE(size); atomic_set(&skb->users, 1); skb->head = data; skb->data = data; skb_reset_tail_pointer(skb); skb->end = skb->tail + size; #ifdef NET_SKBUFF_DATA_USES_OFFSET skb->mac_header = ~0U; #endif /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); atomic_set(&shinfo->dataref, 1); kmemcheck_annotate_variable(shinfo->destructor_arg); if (fclone) { struct sk_buff *child = skb + 1; atomic_t *fclone_ref = (atomic_t *) (child + 1); kmemcheck_annotate_bitfield(child, flags1); kmemcheck_annotate_bitfield(child, flags2); skb->fclone = SKB_FCLONE_ORIG; atomic_set(fclone_ref, 1); child->fclone = SKB_FCLONE_UNAVAILABLE; } out: return skb; nodata: kmem_cache_free(cache, skb); skb = NULL; goto out; } EXPORT_SYMBOL(__alloc_skb); |
struct sk_buff *dev_alloc_skb(unsignedint len);
dev_alloc_skb()函數以GFP_ATOMIC 優先級調用上面的alloc_skb()函數。
並保存skb->dead 和skb->data之間的16個字節
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
/** * dev_alloc_skb - allocate an skbuff for receiving * @length: length to allocate * * Allocate a new &sk_buff and assign it a usage count of one. The * buffer has unspecified headroom built in. Users should allocate * the headroom they think they need without accounting for the * built in space. The built in space is used for optimisations. * * %NULL is returned if there is no free memory. Although this function * allocates memory it can be called from an interrupt. */ struct sk_buff *dev_alloc_skb(unsigned int length) { /* * There is more code here than it seems: * __dev_alloc_skb is an inline */ return __dev_alloc_skb(length, GFP_ATOMIC); } EXPORT_SYMBOL(dev_alloc_skb); |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
/** * __dev_alloc_skb - allocate an skbuff for receiving * @length: length to allocate * @gfp_mask: get_free_pages mask, passed to alloc_skb * * Allocate a new &sk_buff and assign it a usage count of one. The * buffer has unspecified headroom built in. Users should allocate * the headroom they think they need without accounting for the * built in space. The built in space is used for optimisations. * * %NULL is returned if there is no free memory. */ static inline struct sk_buff *__dev_alloc_skb(unsigned int length, gfp_t gfp_mask) { struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask); if (likely(skb)) skb_reserve(skb, NET_SKB_PAD); return skb; } |
2.2 釋放套接字緩衝區
void kfree_skb(struct sk_buff *skb);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
/** * kfree_skb - free an sk_buff * @skb: buffer to free * * Drop a reference to the buffer and free it if the usage count has * hit zero. */ void kfree_skb(struct sk_buff *skb) { if (unlikely(!skb)) return; if (likely(atomic_read(&skb->users) == 1)) smp_rmb(); else if (likely(!atomic_dec_and_test(&skb->users))) return; trace_kfree_skb(skb, __builtin_return_address(0)); __kfree_skb(skb); } EXPORT_SYMBOL(kfree_skb); |
— kfree_skb() 函數只能在內核內部使用,網絡設備驅動中必須使用dev_kfree_skb()、dev_kfree_skb_irq() 或dev_kfree_skb_any().
void dev_kfree_skb(struct sk_buff *skb);
— dev_kfree_skb()用於非中斷上下文。
1 |
#define dev_kfree_skb(a) consume_skb(a) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
/** * consume_skb - free an skbuff * @skb: buffer to free * * Drop a ref to the buffer and free it if the usage count has hit zero * Functions identically to kfree_skb, but kfree_skb assumes that the frame * is being dropped after a failure and notes that */ void consume_skb(struct sk_buff *skb) { if (unlikely(!skb)) return; if (likely(atomic_read(&skb->users) == 1)) smp_rmb(); else if (likely(!atomic_dec_and_test(&skb->users))) return; trace_consume_skb(skb); __kfree_skb(skb); } EXPORT_SYMBOL(consume_skb); |
void dev_kfree_skb_irq(struct sk_buff *skb);
— dev_kfree_skb_irq() 用於中斷上下文。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
void dev_kfree_skb_irq(struct sk_buff *skb) { if (atomic_dec_and_test(&skb->users)) { struct softnet_data *sd; unsigned long flags; local_irq_save(flags); sd = &__get_cpu_var(softnet_data); skb->next = sd->completion_queue; sd->completion_queue = skb; raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_restore(flags); } } EXPORT_SYMBOL(dev_kfree_skb_irq); |
void dev_kfree_skb_any(struct sk_buff *skb);
— dev_kfree_skb_any() 在中斷或非中斷上下文中都能使用。
1 2 3 4 5 6 7 8 |
void dev_kfree_skb_any(struct sk_buff *skb) { if (in_irq() || irqs_disabled()) dev_kfree_skb_irq(skb); else dev_kfree_skb(skb); } EXPORT_SYMBOL(dev_kfree_skb_any); |
2.3移動指針
Linux套接字緩衝區中的指針移動操作有:put(放置), push(推), pull(拉)和reserve(保留)等。
2.3.1 put操作
unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
將tail 指針下移,增加sk_buff 的len 值,並返回skb->tail 的當前值。
將數據添加在buffer的尾部。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
/** * skb_put - add data to a buffer * @skb: buffer to use * @len: amount of data to add * * This function extends the used data area of the buffer. If this would * exceed the total buffer size the kernel will panic. A pointer to the * first byte of the extra data is returned. */ unsigned char *skb_put(struct sk_buff *skb, unsigned int len) { unsigned char *tmp = skb_tail_pointer(skb); SKB_LINEAR_ASSERT(skb); skb->tail += len; skb->len += len; if (unlikely(skb->tail > skb->end)) skb_over_panic(skb, len, __builtin_return_address(0)); return tmp; } EXPORT_SYMBOL(skb_put); |
1 2 3 4 |
static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb) { return skb->tail; } |
unsigned char *__skb_put(struct sk_buff *skb, unsigned int len);
__skb_put() 與skb_put()的區別在於skb_put()會檢測放入緩衝區的數據, 而__skb_put()不會檢查
1 2 3 4 5 6 7 8 |
static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len) { unsigned char *tmp = skb_tail_pointer(skb); SKB_LINEAR_ASSERT(skb); skb->tail += len; skb->len += len; return tmp; } |
2.3.2 push操作:
unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
skb_push()會將data指針上移,也就是將數據添加在buffer的起始點,因此也要增加sk_buff的len值。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
/** * skb_push - add data to the start of a buffer * @skb: buffer to use * @len: amount of data to add * * This function extends the used data area of the buffer at the buffer * start. If this would exceed the total buffer headroom the kernel will * panic. A pointer to the first byte of the extra data is returned. */ unsigned char *skb_push(struct sk_buff *skb, unsigned int len) { skb->data -= len; skb->len += len; if (unlikely(skb->data<skb->head)) skb_under_panic(skb, len, __builtin_return_address(0)); return skb->data; } EXPORT_SYMBOL(skb_push); |
unsigned char *__skb_push(struct sk_buff *skb, unsigned int len);
1 2 3 4 5 6 |
static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len) { skb->data -= len; skb->len += len; return skb->data; } |
__skb_push()和skb_push()的區別與__skb_put() 和skb_put()的區別一樣。
push操作在緩衝區的頭部增加一段可以存儲網絡數據包的空間,而put操作在緩衝區的尾部增加一段可以存儲網絡數據包的空間。
2.3.3 pull操作:
unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
skb_pull()將data指針下移,並減少skb的len值, 這個操作與skb_push()對應。
這個操作主要用於下層協議向上層協議移交數據包,使data指針指向上一層協議頭
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
/** * skb_pull - remove data from the start of a buffer * @skb: buffer to use * @len: amount of data to remove * * This function removes data from the start of a buffer, returning * the memory to the headroom. A pointer to the next data in the buffer * is returned. Once the data has been pulled future pushes will overwrite * the old data. */ unsigned char *skb_pull(struct sk_buff *skb, unsigned int len) { return skb_pull_inline(skb, len); } EXPORT_SYMBOL(skb_pull); |
1 2 3 4 |
static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len) { return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len); } |
1 2 3 4 5 6 |
static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len) { skb->len -= len; BUG_ON(skb->len < skb->data_len); return skb->data += len; } |
2.3.4 reserve 操作
void skb_reserve(struct sk_buff *skb, unsigned int len);
skb_reserve()將data指針和tail 指針同時下移。
這個操作用於在緩衝區頭部預留len長度的空間
1 2 3 4 5 6 7 8 9 10 11 12 13 |
/** * skb_reserve - adjust headroom * @skb: buffer to alter * @len: bytes to move * * Increase the headroom of an empty &sk_buff by reducing the tail * room. This is only allowed for an empty buffer. */ static inline void skb_reserve(struct sk_buff *skb, int len) { skb->data += len; skb->tail += len; } |
3. 例子:
Linux處理一個UDP數據包的接收流程,來說明對sk_buff的操作過程。
這一過程絕大部分工作會在內核完成,驅動中只需要完成涉及數據鏈路層部分。
假設網卡收到一個UDP數據包,Linux處理流程如下:
3.1 網卡收到一個UDP數據包後,驅動程序需要創建一個sk_buff結構體和數據緩衝區,將接收到的數據全部複製到data指向的空間,並將skb->mac_header指向data。
此時有效數據的開始位置data是一個以太網頭部,即鏈路層協議頭。
示例代碼如下:
//分配新的套接字緩衝區和數據緩衝區
3.2 數據鏈路層通過調用skb_pull() 剝掉以太網協議頭,向網絡層IP傳送數據包。
在剝離過程中,data指針會下移一個以太網頭部的長度sizeof(struct ethhdr), 而len 也減去sizeof(struct ethhdr)長度。
此時有效數據的開始位置是一個IP協議頭,skb->network_head指向data,即IP協議頭, 而skb->mac_header 依舊指向以太網頭, 即鏈路層協議頭。
內容如下圖所示:
3.3 網絡層通過skb_pull()剝掉IP協議頭,向UDP傳輸層傳遞數據包。
剝離過程中,data指針會下移一個IP協議頭長度sizeof(struct iphdr), 而len也會減少sizeof(struct iphdr)長度。
此時有效數據開始位置是一個UDP協議頭, skb->transport_header指向data,即UDP協議頭。
而skb->network_header繼續指向IP協議頭, skb->mac_header 繼續指向鏈路層協議頭。
如下圖所示:
3.4 應用程序在調用recv() 接收數據時,從skb->data + sizeof(struct udphdr) 的位置開始復製到應用層緩衝區。
可見,UPD協議頭到最後也沒有被剝離。