1 / 17

LINUX NETWORK IMPLEMENTATION Jianyong Zhang

LINUX NETWORK IMPLEMENTATION Jianyong Zhang. Introduction. The layer structure of network: BSD socket layer: general data structure for different protocols. INET socket layer: end points for the IP-based protocols TCP and UDP ARP layer Link layer: Ethernet, SLIP, PLIP

hafwen
Download Presentation

LINUX NETWORK IMPLEMENTATION Jianyong Zhang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LINUX NETWORK IMPLEMENTATIONJianyong Zhang

  2. Introduction • The layer structure of network: • BSD socket layer: general data structure for different protocols. • INET socket layer: end points for the IP-based protocols TCP and UDP • ARP layer • Link layer: Ethernet, SLIP, PLIP • Hardware: NIC, serial port, parallel port-

  3. Socket system call • C interface system call routines: Socket(), bind(), listen(), connect(), accept(), send(), sendto(), recv(), recvfrom(), getsockopt(), setsockopt(). • All are based on the system call socketcall(). • Socket() return a file descriptor, read(), write(), select(), ioctl() use struct file: filef_opsock_read • Socket inode: struct socket *sock_alloc(void) • {… inode->i_mode = S_IFSOCK|S_IRWXUGO; • inode->i_sock = 1; • inode->i_uid = current->fsuid; • inode->i_gid = current->fsgid; • sock->inode = inode; … • }

  4. Generic system call • socketcall() function: • asmlinkage int sys_socketcall(int call, unsigned long *args) • {… • unsigned long a0,a1; • /* copy_from_user should be SMP safe. */ • if (copy_from_user(a, args, nargs[call])) • return -EFAULT; • a0=a[0]; • a1=a[1]; • switch(call) • { • case SYS_SOCKET: • err = sys_socket(a0,a1,a[2]); • break; • case SYS_BIND: • err = sys_bind(a0,(struct sockaddr *)a1, a[2]); • break; … } …. • }

  5. Important structures • 1. struct socket { • socket_state state; /* SS_FREE, SS_UNCONNECTED, SS_CONNECTING, SS_CONNECTED, SS_DISCONNECTIN*/ • unsigned long flags; • struct proto_ops *ops; • struct inode *inode; • struct fasync_struct *fasync_list; /* Asynchronous wake up list*/ struct file *file; /* File back pointer*/ • struct sock *sk; • struct wait_queue *wait; • short type;//SOCK_STREAM, SOCK_DGRAM, SOCK_RAW • unsigned char passcred; • unsigned char tli; • };

  6. Important structures • 2. struct proto_ops { • int family; • int (*dup) (struct socket *newsock, struct socket *oldsock); • int (*release) (struct socket *sock, struct socket *peer); • int (*bind) (); int (*connect) (); • int (*socketpair) (struct socket *sock1, struct socket *sock2); • int (*accept) (); • int (*getname) (); • unsigned int (*poll) (); int (*ioctl) (); • int (*listen) (struct socket *sock, int len); • int (*shutdown) (struct socket *sock, int flags); • int (*setsockopt) (struct socket *sock, int level, int optname, • int (*getsockopt) (); • int (*fcntl) (); • int (*sendmsg) (); • int (*recvmsg) (); • };

  7. Important structures • 3 . Struct sk_buff {.. . }: • manage individual communication packets, • a doule-link list • 4. Struct sock { … } • INET socket • 5. Struct device {…} • contols an abstract network device: network interface.

  8. Getting the data from A to B • 1. A,B call socket(), then are connected by calling connect(), accept(). • 2. A: write(socket,data.len): verify_area(). • {… file = fget(socket); inode = file->f_dentry->d_inode; • if (!file->f_op || !(write= file->f_op->write)) goto out; • down(&inode->i_sem); • ret = write(file, data, len, &file->f_pos); • up(&inode->i_sem);… } • 3. Sock_write() { …struct socket *sock; • sock = socki_lookup(file->f_dentry->d_inode); … • msg.msg_iov=&iov; iov.iov_base=(void *)ubuf; … • return sock_sendmsg(sock, &msg, size); } • 4. For INET socket, it will call inet_sendmsg().

  9. Getting the data from A to B • 5. inet_sendmsg() { • struct sock *sk = sock->sk; … • return sk->prot->sendmsg(sk, msg, size); } • /* call tcp_v4_sendmsg() */ • 6. Call tcp_do_sendmsg(sk, msg) {… • struct sk_buff *skb; • tmp = MAX_HEADER + sk->prot->max_header; • skb = sock_wmalloc(sk, tmp, 0, GFP_KERNEL); • skb_reserve(skb, MAX_HEADER + sk->prot->max_header); • skb->csum = csum_and_copy_from_user(from, skb_put(skb, copy), copy, 0, &err); • /*TCP data bytes are SKB_PUT() on top, later TCP+IP+DEV headers are SKB_PUSH()'d beneath. */ • tcp_send_skb(sk, skb, queue_it); …}

  10. Getting the data from A to B • 5. tcp_send_skb() call tcp_transmit_skb(sk, skb_clone(skb, GFP_KERNEL)); • 6. tcp_transmit_skb(struct sock *sk, struct sk_buff *skb) {… struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp); • /* Build TCP header and checksum it. */ … • tp->af_specific->queue_xmit(skb); • 7. Ip_queue_xmit() /* Queues a packet to be sent, and starts the transmitter if necessary. This routine also needs to put in the total length and compute the checksum. */ • {… • /* Make sure we can route this packet. */ • skb->dst = dst_clone(sk->dst_cache); • /* OK, we know where to send it, allocate and build IP header. */… • /* Do we need to fragment. Again this is inefficient. We need to somehow lock the original buffer and use bits of it. */… • /* Add an IP checksum. */…

  11. Getting the data from A to B • skb->dst->output(skb); … } • 7. Bh synchronization with barrier: • start_bh_atomic(void), end_bh_atomic(void) • 8. Dev_queue_xmit() {… • start_bh_atomic(); q = dev->qdisc; • if (q->enqueue) { • q->enqueue(skb, q); • qdisc_wakeup(dev); • end_bh_atomic(); … return;} • if (dev->flags&IFF_UP) { • dev->hard_start_xmit(skb, dev); • end_bh_atomic(); • return;} • } • 9. For the WD8013 card, call ei_start_xmit(), pass the data to network adaptor, which in turn sends the packet to the Ethernet.

  12. Getting the data from A to B • 10. The data, embedded in an Ethernet packet, are received by NIC in B. (NIC is assumed WD8013) • 11. NIC trigger an interrupt. This is handled by ei_interrupt(). Call ei_receive() (ei_* functions are chip-specific code for many 8390-based ethernet adaptors) • 12. Ei_receive() { … struct sk_buff *skb; • skb = dev_alloc_skb(pkt_len+2);…. • netif_rx(skb); …} • 13 netif_rx() receive a packet from a device driver and queue it for the upper (protocol) levels. Call {skb_queue_tail(&backlog,skb); mark_bh(NET_BH); } • 14. There is only one list of backlog in the entire system. • 15. Do_bottom_half() calls net_bh()

  13. Getting the data from A to B • 10. net_bh() {… • skb = skb_dequeue(&backlog); • /* Bump the pointer to the next structure. skb->data and skb->nh.raw point to the MAC and encapsulated data */ • skb->h.raw = skb->nh.raw = skb->data; • /* Fetch the packet protocol ID. */ • type = skb->protocol; • /* We got a packet ID. Now loop over the "known protocols" list. There are two lists. The ptype_all list of taps (normally empty) and the main protocol list which is hashed perfectly for normal protocols. */… • if (ptype->type == type && (ptype->dev==skb->dev)) • {/*We already have a match queued. Deliver to it*/ • skb2=skb_clone(skb, GFP_ATOMIC); • pt_prev->func(skb2, skb->dev, pt_prev);…}

  14. Getting the data from A to B • 10. Call ip_rcv() {… • /* check the header for correctness and deal with all the IP options. Ip_forward() and ip_defrag() */ … • return skb->dst->input(skb); } • 11 ip_local_deliver() {… • /* Reassemble IP fragments.*/ skb = ip_defrag(skb); • /*Deliver to raw sockets. This is fun as to avoid copies we want to make no surplus copies. */ … • /* Pass on the datagram to each protocol that wants it, based on the datagram protocol. */... • ipprot->handler(skb2, ntohs(iph->tot_len) - (iph->ihl * 4)); …} • 12 tcp_v4_rcv(), udp_rcv(),icmp_rcv()

  15. Getting the data from A to B • 13. tcp_v4_rcv() {… • /* check the header for correctness */ … • if (!atomic_read(&sk->sock_readers)) • return tcp_v4_do_rcv(sk, skb); • __skb_queue_tail(&sk->back_log, skb); • do_time_wait: case TCP_TW_ACK: tcp_v4_send_ack(); • …} • 14. tcp_v4_do_rcv() call • { …__skb_queue_tail(&nsk->back_log, skb); • if (sk->state == TCP_ESTABLISHED) { /* Fast path */ • if (tcp_rcv_established(sk, skb, skb->h.th, skb->len)) • goto reset; • return 0; } • tcp_rcv_state_process(sk, skb, skb->h.th, skb->len);…}

  16. Getting the data from A to B • 15. TCP receive function for the ESTABLISHED state. • * It is split into a fast path and a slow path. The fast path is disabled when: • * - A zero window was announced from us - zero window probing • * is only handled properly in the slow path. • * - Out of order segments arrived. • * - Urgent data is expected. • * - There is no buffer space left • * - Unexpected TCP flags/window values/header lengths are received (detected by checking the TCP header against pred_flags) • * - Data is sent in both directions. Fast path only supports pure senders or pure receivers (this means either the sequence number or the ack value must stay constant) • * When these conditions are not satisfied it drops into a standard • * receive procedure patterned after RFC793 to handle all cases. • * The first three cases are guaranteed by proper pred_flags setting, • * the rest is checked inline. Fast processing is turned on in • * tcp_data_queue when everything is OK.

  17. Getting the data from A to B • 16. Tcp_data() enter the buffer sk_buff in the list • 17. Data_ready() wake up the waiting processes. • 18 The former actions are carried up in the kernel, outside of any process. • 19. B executes read(socket, data, len). • 20. Through sys_read() --- sock_read() – inet_rcvmsg()– tcp_rcvmsg(). • 21 This completes the data’s travels from process A to process B. • 22 The data is copied only four times: • 1) From the user space of A to kernel memory • 2) From kernel memory to network card. • 3) From network card to another computer’s kernel memory • 4) From B’s kernel memory to B’s user space

More Related