Linux的網絡設備驅動架構- 接收封包/傳遞封包

(出處: Linux的網絡設備驅動架構)

從上到下可以劃分為4層，依次為：
—網絡協議接口層：向網絡層提供統一的數據包收發接口。
—設備層：向協議接口層提供統一的用於描述具體網絡設備的屬性和操作的結構體的net_device。
—驅動層：驅動層的各函數是net_device的數據結構的具體成員，使設備硬件完成相應動作的程序。
—物理層，也可以叫媒介層：是完成數據包發送和接收的物理實體。

在設計具體的網絡設備驅動程序時，我們的主要工作是編寫驅動層相關的函數，填充net_device數據結構，並將net_device 註冊進內核。

1. 協議接口層

1.1 dev_queue_xmit()
當上層協議，如ARP或IP協議，需要發送數據包時，將調用dev_queue_xmit()函數發送該數據包，同時傳遞給該函數一個指向struct sk_buff 數據結構的指針，該結構就代表了一個要傳送的數據包dev_queue_xmit()定義如下，注意這是向網絡發送一個數據包。當上層協議，如ARP或IP協議，需要發送數據包時，將調用dev_queue_xmit()函數發送該數據包，同時傳遞給該函數一個指向struct sk_buff 數據結構的指針，該結構就代表了一個要傳送的數據包dev_queue_xmit()定義如下，注意這是向網絡發送一個數據包。

/**
 *	dev_queue_xmit - transmit a buffer
 *	@skb: buffer to transmit
 *
 *	Queue a buffer for transmission to a network device. The caller must
 *	have set the device and priority and built the buffer before calling
 *	this function. The function can be called from an interrupt.
 *
 *	A negative errno code is returned on a failure. A success does not
 *	guarantee the frame will be transmitted as it may be dropped due
 *	to congestion or traffic shaping.
 *
 * -----------------------------------------------------------------------------------
 *      I notice this method can also return errors from the queue disciplines,
 *      including NET_XMIT_DROP, which is a positive value.  So, errors can also
 *      be positive.
 *
 *      Regardless of the return value, the skb is consumed, so it is currently
 *      difficult to retry a send to this method.  (You can bump the ref count
 *      before sending to hold a reference for retry if you are careful.)
 *
 *      When calling this method, interrupts MUST be enabled.  This is because
 *      the BH enable code must have IRQs enabled so that it will not deadlock.
 *          --BLG
 */
int dev_queue_xmit(struct sk_buff *skb)
{
	struct net_device *dev = skb->dev;
	struct netdev_queue *txq;
	struct Qdisc *q;
	int rc = -ENOMEM;

	/* Disable soft irqs for various locks below. Also
	 * stops preemption for RCU.
	 */
	rcu_read_lock_bh();

	skb_update_prio(skb);

	txq = dev_pick_tx(dev, skb);
	q = rcu_dereference_bh(txq->qdisc);

#ifdef CONFIG_NET_CLS_ACT
	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
#endif
	trace_net_dev_queue(skb);
	if (q->enqueue) {
		rc = __dev_xmit_skb(skb, q, dev, txq);
		goto out;
	}

	/* The device has no queue. Common case for software devices:
	   loopback, all the sorts of tunnels...

	   Really, it is unlikely that netif_tx_lock protection is necessary
	   here.  (f.e. loopback and IP tunnels are clean ignoring statistics
	   counters.)
	   However, it is possible, that they rely on protection
	   made by us here.

	   Check this and shot the lock. It is not prone from deadlocks.
	   Either shot noqueue qdisc, it is even simpler 8)
	 */
	if (dev->flags & IFF_UP) {
		int cpu = smp_processor_id(); /* ok because BHs are off */

		if (txq->xmit_lock_owner != cpu) {

			if (__this_cpu_read(xmit_recursion) > RECURSION_LIMIT)
				goto recursion_alert;

			HARD_TX_LOCK(dev, txq, cpu);

			if (!netif_xmit_stopped(txq)) {
				__this_cpu_inc(xmit_recursion);
				rc = dev_hard_start_xmit(skb, dev, txq);
				__this_cpu_dec(xmit_recursion);
				if (dev_xmit_complete(rc)) {
					HARD_TX_UNLOCK(dev, txq);
					goto out;
				}
			}
			HARD_TX_UNLOCK(dev, txq);
			if (net_ratelimit())
				pr_crit("Virtual device %s asks to queue packet!\n",
					dev->name);
		} else {
			/* Recursion is detected! It is possible,
			 * unfortunately
			 */
recursion_alert:
			if (net_ratelimit())
				pr_crit("Dead loop on virtual device %s, fix it urgently!\n",
					dev->name);
		}
	}

	rc = -ENETDOWN;
	rcu_read_unlock_bh();

	kfree_skb(skb);
	return rc;
out:
	rcu_read_unlock_bh();
	return rc;
}
EXPORT_SYMBOL(dev_queue_xmit);

100

101

102

103

104

105

106

107

/**

* dev_queue_xmit - transmit a buffer

* @skb: buffer to transmit

* Queue a buffer for transmission to a network device. The caller must

* have set the device and priority and built the buffer before calling

* this function. The function can be called from an interrupt.

* A negative errno code is returned on a failure. A success does not

* guarantee the frame will be transmitted as it may be dropped due

* to congestion or traffic shaping.

* -----------------------------------------------------------------------------------

* I notice this method can also return errors from the queue disciplines,

* including NET_XMIT_DROP, which is a positive value. So, errors can also

* be positive.

* Regardless of the return value, the skb is consumed, so it is currently

* difficult to retry a send to this method. (You can bump the ref count

* before sending to hold a reference for retry if you are careful.)

* When calling this method, interrupts MUST be enabled. This is because

* the BH enable code must have IRQs enabled so that it will not deadlock.

* --BLG

int dev_queue_xmit(struct sk_buff *skb)

{

struct net_device *dev = skb->dev;

struct netdev_queue *txq;

struct Qdisc *q;

int rc = -ENOMEM;

/* Disable soft irqs for various locks below. Also

* stops preemption for RCU.

rcu_read_lock_bh();

skb_update_prio(skb);

txq = dev_pick_tx(dev, skb);

q = rcu_dereference_bh(txq->qdisc);

#ifdef CONFIG_NET_CLS_ACT

skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);

#endif

trace_net_dev_queue(skb);

if (q->enqueue) {

rc = __dev_xmit_skb(skb, q, dev, txq);

goto out;

}

/* The device has no queue. Common case for software devices:

loopback, all the sorts of tunnels...

Really, it is unlikely that netif_tx_lock protection is necessary

here. (f.e. loopback and IP tunnels are clean ignoring statistics

counters.)

However, it is possible, that they rely on protection

made by us here.

Check this and shot the lock. It is not prone from deadlocks.

Either shot noqueue qdisc, it is even simpler 8)

if (dev->flags & IFF_UP) {

int cpu = smp_processor_id(); /* ok because BHs are off */

if (txq->xmit_lock_owner != cpu) {

if (__this_cpu_read(xmit_recursion) > RECURSION_LIMIT)

goto recursion_alert;

HARD_TX_LOCK(dev, txq, cpu);

if (!netif_xmit_stopped(txq)) {

__this_cpu_inc(xmit_recursion);

rc = dev_hard_start_xmit(skb, dev, txq);

__this_cpu_dec(xmit_recursion);

if (dev_xmit_complete(rc)) {

HARD_TX_UNLOCK(dev, txq);

goto out;

}

HARD_TX_UNLOCK(dev, txq);

if (net_ratelimit())

pr_crit("Virtual device %s asks to queue packet!\n",

dev->name);

} else {

/* Recursion is detected! It is possible,

* unfortunately

recursion_alert:

if (net_ratelimit())

pr_crit("Dead loop on virtual device %s, fix it urgently!\n",

dev->name);

}

rc = -ENETDOWN;

rcu_read_unlock_bh();

kfree_skb(skb);

return rc;

out:

rcu_read_unlock_bh();

return rc;

}

EXPORT_SYMBOL(dev_queue_xmit);

1.2 netif_rx()
同樣，對數據包的接收，是通過向netif_rx() 函數傳遞一個struct sk_buff 數據結構指針來實現：
netif_rx()函數定義如下，注意這是從網絡接收一個數據包到內存。

/**
 *	netif_rx	-	post buffer to the network code
 *	@skb: buffer to post
 *
 *	This function receives a packet from a device driver and queues it for
 *	the upper (protocol) levels to process.  It always succeeds. The buffer
 *	may be dropped during processing for congestion control or by the
 *	protocol layers.
 *
 *	return values:
 *	NET_RX_SUCCESS	(no congestion)
 *	NET_RX_DROP     (packet was dropped)
 *
 */

int netif_rx(struct sk_buff *skb)
{
	int ret;

	/* if netpoll wants it, pretend we never saw it */
	if (netpoll_rx(skb))
		return NET_RX_DROP;

	net_timestamp_check(netdev_tstamp_prequeue, skb);

	trace_netif_rx(skb);
#ifdef CONFIG_RPS
	if (static_key_false(&rps_needed)) {
		struct rps_dev_flow voidflow, *rflow = &voidflow;
		int cpu;

		preempt_disable();
		rcu_read_lock();

		cpu = get_rps_cpu(skb->dev, skb, &rflow);
		if (cpu < 0)
			cpu = smp_processor_id();

		ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);

		rcu_read_unlock();
		preempt_enable();
	} else
#endif
	{
		unsigned int qtail;
		ret = enqueue_to_backlog(skb, get_cpu(), &qtail);
		put_cpu();
	}
	return ret;
}
EXPORT_SYMBOL(netif_rx);

/**

* netif_rx - post buffer to the network code

* @skb: buffer to post

* This function receives a packet from a device driver and queues it for

* the upper (protocol) levels to process. It always succeeds. The buffer

* may be dropped during processing for congestion control or by the

* protocol layers.

* return values:

* NET_RX_SUCCESS (no congestion)

* NET_RX_DROP (packet was dropped)

int netif_rx(struct sk_buff *skb)

{

int ret;

/* if netpoll wants it, pretend we never saw it */

if (netpoll_rx(skb))

return NET_RX_DROP;

net_timestamp_check(netdev_tstamp_prequeue, skb);

trace_netif_rx(skb);

#ifdef CONFIG_RPS

if (static_key_false(&rps_needed)) {

struct rps_dev_flow voidflow, *rflow = &voidflow;

int cpu;

preempt_disable();

rcu_read_lock();

cpu = get_rps_cpu(skb->dev, skb, &rflow);

if (cpu < 0)

cpu = smp_processor_id();

ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);

rcu_read_unlock();

preempt_enable();

} else

#endif

{

unsigned int qtail;

ret = enqueue_to_backlog(skb, get_cpu(), &qtail);

put_cpu();

}

return ret;

}

EXPORT_SYMBOL(netif_rx);

1.3 sk_buff
這裡sk_buff非常重要
sk_buff — 套接字緩衝區,用來在linux網絡子系統中各層之間數據傳遞，起到了“神經中樞”的作用。
當發送數據包時，linux內核的網絡模塊必須建立一個包含要傳輸的數據包的sk_buff,然後將sk_buff傳遞給下一層，各層在sk_buff 中添加不同的協議頭，直到交給網絡設備發送。同樣，當接收數據包時，網絡設備從物理媒介層接收到數據後，他必須將接收到的數據轉換為sk_buff，並傳遞給上層，各層剝去相應的協議頭後直到交給用戶。
sk_buff結構如下圖所示： (define at include/linux/skbuff.h)

2. 設備層

設備層的主要功能是為各種網絡設備定義一個統一的數據結構net_device，實現多種硬件在軟件層次上的統一。
net_device結構體在內核中代表一個網絡設備。
網絡設備驅動程序只需要填充net_device的具體成員，並註冊net_device 即可實現硬件操作函數與內核的掛接。
net_device 是一個巨型結構體，包含了網絡設備的屬性描述和接口操作：
更多細節: net_device & net_device_ops

3. 驅動層：

net_device中的成員（屬性和函數指針），需要被驅動層的具體數值和函數賦值。
對於具體的網絡設備ethx，軟件工程師需要編寫設備驅動層的函數，這些函数如下：
ethx_open();
ethx_stop();
ethx_tx();
ethx_hard_header();
ethx_get_stats();
ethx_tx_timeout();
ethx_poll();
等。

網絡數據包的接收可以由中斷引發，所以驅動層的另一個主要部分是中斷處理函數。
它負責讀取硬件設備上的數據包並傳給上層協議。
中斷處理函數一般如下：
ethx_interrupt(); — 完成中斷類型判斷等基本工
ethx_rx(); — 完成數據包的生成和傳遞給上層等複雜工作。

對於特定的設備，我們還可以定義其相關的私有數據（private_data）和操作，並封裝一個私有信息結構體struct ethx_private，讓其指針被賦值給net_device 的priv 成員。
struct ethx_private 結構體中可以包含設備特殊的屬性，自旋鎖/信號量，定時器，及統計信息等，由工程師自己定義。

4. 物理媒介層

媒介層直接對應實際的硬件設備，為了對設備的物理配置和寄存器操作進行普通操作，我們可以定義一組宏或一組訪問函數，對內部寄存器進行訪問。
具體的定義和函數，對特定的硬件緊密相關。下面是設計範例：

//寄存器定义  
#define DATA_REG 0x0004  
#define CMD_REG 0x0008  
  
//寄存器读写函数  
static u16 xxx_readword(u32 base_addr, int portno)  
{  
...  //读取寄存器的值并返回  
}  
  
static void xxx_writeword(u32 base_addr, int portno, u16 value)  
{  
...   //想寄存器写入数值  
}

//寄存器定义

#define DATA_REG 0x0004

#define CMD_REG 0x0008

//寄存器读写函数

static u16 xxx_readword(u32 base_addr, int portno)

{

... //读取寄存器的值并返回

}

static void xxx_writeword(u32 base_addr, int portno, u16 value)

{

... //想寄存器写入数值

}

1. 協議接口層

2. 設備層

3. 驅動層：

4. 物理媒介層

發表迴響取消回覆