Loading...

Summer Research Fellowship Programme of India's Science Academies

Understanding the concepts and mechanisms of RDMA

Lakshmi Kittur

K L S Gogte Institute of Technology, Belagavi 590008

Guided by:

Dr. J Lakshmi

Indian Institute of Science, Bengaluru 560012

Abstract

The speedy transfer of data between applications is critical so that information can be handled efficiently. Remote Direct Memory Access (RDMA) helps in boosting the efficiency of data centers as it provides low latency and high throughput. RDMA was found earlier only in Infiniband fabrics, but with the advent of RDMA over Converged Ethernet (RoCE) and iWARP, the merits of RDMA are also available to Ethernet based data centers. The software implementation of RoCE is known as Soft-RoCE.

The benefits of RDMA has helped in the development of an open source software called INFINISWAP. INFINISWAP is a remote paging system designed specifically for remote direct memory access (RDMA) networks that enables unmodified applications to access memory from a remote pool without requiring any changes to the underlying operating system or the hardware.

Hence in this study, an attempt is made to explore and understand various concepts and mechanisms related to RDMA. Also, to know how Soft-RoCE can be used to perform RDMA operations, an RDMA client-server program is implemented. Further, as a case study to understand the benefits of RDMA, a summary of INFINISWAP is presented.

Keywords: Memory disaggregation, RDMA, memory intensive applications, INFINISWAP, Soft-RoCE

Abbreviations

RDMARemote Direct Memory Access
RoCERDMA over Converged Ethernet
iWARPInternet wide-Area Network
NICNetwork Interface Card
HCAHost Channel Adapter
APIApplication Programming Interface
QPQueue Pair
WQEWork Queue Element
CQECompletion Queue Element
HPCHigh Performance Computing
IETFInternet Engineering Task Force

INTRODUCTION

Background/Rationale

Day by day, large amount of data is being generated. The applications processing this data that provide services to end users require high performance system stack to sustain SLAs. In a distributed application framework data is often copied across nodes, therefore data movement is so critical.The quick transfer of data is the need of hour. Remote Direct Memory Access(RDMA) helps in boosting the performance of applications that need low latency and high throughput as it supports kernel bypass, zero-copy and does not involve CPU. Remote Direct Memory Access(RDMA) is direct access of memory from memory of one machine to the memory of another machine.Using RDMA has following advantages:

1.Kernel bypass: During data transfers, as operating system is not involved, the data transfer can be done by applications directly from user space reducing the context switching and latency.

2. Zero-copy: Applications can place data directly in the destination application's memory buffer and receive data directly into buffer without copying the data between network layers. This will reduce unnecessary buffer transfers.

3. No CPU involvement: Applications can access remote memory without using CPU. In traditional Network Interface Card(NIC), CPU has to move data on and off the network but in RDMA,CPU is needed only for signalling setup and completion and data will be placed directly into remote memory.

Thus RDMA helps in increasing throughput and decreasing latency. We can use RDMA in applications that need either low latency for e.g high performance computing(HPC) or high bandwidth for e.g cloud computing, HPC.

rdma2.jpg
    RDMA protocol stack, https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-introduction-to-rdma/

    How RDMA works?

    In order to use RDMA, a Network Interface Card(NIC) is required. This NIC implements RDMA engine. This is called Host Channel Adapter(HCA).The HCA creates a channel from RDMA engine to application memory as shown in ​​Fig 1​​. All the logic that is required that carry out RDMA protocols over wire will be embedded in the hardware of HCA. As depicted in the ​​Fig 1​​, the RDMA kernel module is used to establish data and command channels. The "verbs" Application Programming Interface (API) is used to establish the data channels.

    Before one starts performing RDMA operations, there is a need to pin memory to tell the kernel that the registered memory is for RDMA communication for the application.HCA will create a channel from the NIC to this memory. This process is known as registering memory.

    qp-basic1.jpg
      Queue Pair, https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-introduction-to-rdma/

      As shown in the ​Fig 1​, RDMA communications basically use three types of queues ​ namely: Send Queue, Receive Queue and Completion Queue..

      The send and receive queue are always created in pair and is called as Queue-Pair(QP). They are responsible for scheduling work. The Completion Queue(CQ) is created to notify the completion of instructions in work queue. The instructions placed on work queue tell HCA as to what buffer it intends to send or receive. These instructions are called Work Queue Element (WQE). A WQE placed in send queue contains a pointer to buffer that has to be sent to client. Whereas, a WQE placed in receive queue contains a pointer to a buffer that will hold the incoming message. Once a transaction is completed, a Completion Queue Element (CQE) will be placed on Completion Queue.

      REALIZATION OF RDMA

      RDMA is supported by Infiniband fabrics and two ethernet fabrics namely RDMA over Converged Ethernet (RoCe) and Internet wide-Area Network (iWARP).

      1. Infiniband : It is a networking protocol that supports RDMA. It needs hardware support i.e it needs switches and NIC's that support this technology.

      2. RoCE: It is a networking protocol that allows RDMA over Ethernet network. RoCE does not require the data centres running on Ethernet to convert into Infiniband. Basically, there are two versions of RoCE namely RoCE v1 and RoCe v2. RoCE v1 allows communication between two hosts in the same Ethernet broadcast whereas RoCE v2 packets can be routed. Hence, RoCE v1 is Ethernet layer protcol and RoCE v2 is internet protocol.

      The software implementation of RoCE is known as Soft-RoCE that is explained in the section 4.

      3. iWARP: It is Internet Engineering Task Force(IETF) Standard. iWARP is a networking protocol that runs on top of TCP/IP. iWARP packets are routable.

      What is Soft-RoCE ?

      The software implementation of RoCE is called Soft-RoCE that provides RoCE technology to all Ethernet-enabled servers. .​​Fig 3​​ depicts the complete implementation of RDMA stack over Network Interface Card(NIC). Soft-RoCE has ib_rxe kernel module and librxe user space library. The Soft-RoCE driver is available for Linux platform. Installing Mellanox OFED 4.0 or upstream will automatically provide Soft-RoCE user space libraries and kernel modules. Soft-RoCE helps servers in the data centres that have Ethernet adapters and do not have hardware for RDMA to connect to high performance storage units that utilize hardware-based RoCE and hence, enabling them to efficiently deliver the data[​Soft-RoCE​ ]. It allows a system with Ethernet adapter to operate with another system having either hardware-based RoCE or Soft-RoCE.

      soft-roce.PNG
        Soft-RoCE Architecture, https://community.mellanox.com/docs/DOC-2184

        To understand how RoCE is implemented via software and to perform RDMA operations, the Soft-RoCE was installed on Ubuntu 16.04 and a client-server RDMA program was run on it which is explained in the section 5.

        RDMA Programming

        In order to perform RDMA operations, we need to establish connection with remote host and also appropriate permissions have to be set. This can be achieved using mechanism called Queue Pair(QP). Once the Queue Pair is established, we can perform RDMA send, RDMA receive, RDMA read and RDMA write using verb API.

        RDMA transfer types are :-

        1. SEND/RECV : This is similar to TCP sockets. Sender must issue listen before client issues connect. Every sender should have a receiver. These two verbs are two sided.

        2.READ/WRITE: They specify the remote memory address on which operations should be carried out. These two verbs are one sided.

        Below is a sample RDMA program flow that I have implemented wherein client sends 1 million numbers to the server and the server replies back by scaling those numbers by a factor of 2.

        The program flow on Server Side is as listed below :-

        1. Create an event channel.

        2. Bind to an address.

        3. Create a listener to listen at the port for connection request.

        4. Create a protection domain, completion queue, and send-receive queue pair.

        5. Post-receive before you accept the connection

        6. Accept the connection request.

        7. Wait for the receive completion.

        8. Send results back to client.

        The program flow on Client Side is as listed below:-

        1. Create an event channel.

        2. Resolve the peer’s address.

        3. Resolve the route to the peer.

        4. Create a protection domain, completion queue, and send-receive queue pair.

        5. Connect to server.

        7. Wait for the connection to be established.

        8. Pre-post receives

        9. Send 1 million numbers to the server.

        10. Wait for receive completion and receive reply.

        The code ​RDMA program​ is presented in Appendix 1.

        INFINISWAP AND MEMORY DISAGGREGATION

        Since long there have been many attempts to exploit methods to borrow memory from network to reduce the latency issues faced by accessing the memory from disks. But, transferring data through network faced problems of network connections. Nowadays, many data centres are using RDMA for quick data transfers which was earlier used only in high performance computing. Hence, by taking the advantage of RDMA, researchers at the Michigan University have come up with a software called Infiniswap, that provides an efficient way to share memory between servers in the cluster.This sectioncontains the summary of the ​Infiniswap paper​ . ​

        Introduction

        The wide spread use of memory intensive applications is because they provide low-latency services. The low-latency services require user requests to be serviced through in-memory processing with minimal next level (e.g. disk) data accesses. But since we have more and more data to process, it is not possible to fit the entire workload in a single machine’s memory. The problem of fitting workload in memory can be handled in two ways i.e. either over-allocate memory or right size the allocation of memory.However,this may lead to memory underutilization and memory imbalance in the cluster. For instance, previous analysis have shown that for two large production clusters (Facebook and Google), there exists an imbalance in memory utilization across their machines for more than 70% of the time.

        The issue of memory imbalance in cluster can be solved by memory disaggregation, which utilizes the unused memory of several servers in the cluster and presents it as a single pool of memory to all the applications in the cluster. A recent development in this regard is an open source software called INFINISWAP. INFINISWAP implements cluster-wide memory disaggregation wherein the memory of servers in cluster is exposed as a single pool of memory to applications in the cluster that is designed specifically for RDMA networks. This is achieved without making any significant changes to operating system, application or hardware. The issue of memory imbalance in cluster can be solved by memory disaggregation, which utilizes the unused memory of several servers in the cluster and presents it as a single pool of memory  to all the applications in the cluster.

        Whenever, servers in the cluster run out of memory, INFINISWAP lets servers to use remote memory of other servers in the cluster rather than writing to slower storage like disks. INFINISWAP makes use of “power of two choices“, to find a remote machine with free memory and the one with least memory used will be selected.

        INFINISWAP Architecture

        INFINISWAP addresses the concerns of high CPU overheads, failures of remote memory, problem of finding remote machines with free memory through two primary components, namely: the block device and a daemon, which are present in every machine and no central coordinator is required to keep track of status of memory of several servers in the cluster.

        INFINISWAP block device acts as a swap space. The entire address space of block device is divided into fixed sized slabs which are distributed across remote machines' memory. From ​​Fig 3​​ it can be seen that the block device I/O interface is exposed to virtual memory manager(VMM) and that the block device writes asynchronously to disk for fault tolerance and synchronously to remote memory . In INFINISWAP, a slab is unit for load balancing and remote mapping.Slabs from same device are mapped to several remote machines' memory for load balancing and all the pages that belong to same slab are mapped to same remote machine. For page-out request, if that chunk of memory is mapped to remote memory then it writes such page-out request to remote memory using RDMA WRITE synchronously and writes asynchronously to local disk. If the chunk is not mapped to remote memory, then block device writes the page to the local disk synchronously. For page-in requests, block device uses RDMA READ to read the page from appropriate source.

        INFINISWAP daemon runs in user space. It helps in managing the allocation of slabs i.e it manages remote memory.. INFINISWAP daemon participates in various control plane activities like handling chunk-mapping requests from block device, eviction of slabs whenever it observes loss in the performance of local applications. All these control plane activities are performed using RDMA SEND/RECV.

        infiniarch.png
          INFINISWAP Architecture, ​Infiniswap paper

          Performance benefits with INFINISWAP

          The effectiveness of INFINISWAP is evaluated by using four memory-intensive applications and a combination of workloads namely:

          1. TPC-C benchmark on VoltDB

          2. Facebook workloads on Memcached

          3. Twitter graph on PowerGraph

          4. Twitter data on GraphX

          disk.png
            Performance of various applications for different memory workloads, ​Infiniswap paper

            As the memory allocated for workload decreases, the performance usually degrades. The graphs in the Figure 5 show the throughput and latency when 100% of working set is in memory, 75% of working set is in memory and when 50% of working set is in memory. The Figure 5 shows the performance of running the above-mentioned applications with a variety of workloads by accessing the data from disk whenever the data is not found in memory. All plots show the performance of single machine and median value by running 5 times. For latency related plot(lines), lower value is better and, on the other hand, for throughput related graph (bar graph), higher value is preferable.

            fig2.png
              Infiniswap performance for the same applications as in ​​Fig 5​​, ​Infiniswap paper

              The figure 6 shows the performance of running the above-mentioned applications with a variety of workloads by accessing the data from remote memory (i.e using INFINISWAP implementation) whenever the data is not found in local memory. The graphs show the throughput and latency when 100% of working set is in memory, 75% of working set is in memory and when 50% of working set is in memory. All plots show the performance of single machine and median value by running 5 times. For latency related plot(lines), lower value is better and, on the other hand, for throughput related graph (bar graph), higher value is preferable.

              Comparison between the graphs in ​​Fig 5​ and ​​ Fig 6​ :-

              1. VoltDB: VoltDB is an in-memory transactional database that is ACID compliant. ​​Fig 5​​(a)depicts, unlike disk (as shown in ​​​Fig 4​​(a)), the INFINISWAP performance does not decrease super-linearly but instead linearly when smaller amounts of workload fit in memory.

              2. Memcached: Memcached is a distributed caching system that caches data and objects in RAM to decrease the number of times the external disk has to be accessed and thus speeding up the dynamic database driven websites. ​​Fig 5​​ (b) depicts, unlike disk (as show in ​​Fig 4​​(b)), the INFINISWAP performance remains steady rather than decreasing linearly or super linearly when smaller amounts of workload fit in memory.

              3. PowerGraph: PowerGraph is a framework for large-scale machine learning and graph computation ( ​Infiniswap paper​ ). ​​Fig 5​​ (c) depicts, unlike disk (as show in ​​Fig 4​​(c)), the INFINISWAP performance remains stable.

              4. GraphX: GraphX is a specialized graph processing system built on of Apache Spark in-memory analytics engine ( ​Infiniswap paper​). ​​Fig 5​​ (c) depicts, that when compared to paging to disks for 50% working set in-memory, the INFINISWAP makes 2x performance improvement.

              CONCLUSION

              This study showcases various concepts related to RDMA like how RDMA works, the implementations supporting RDMA. The execution of a sample RDMA client-server program using Soft-RoCE gives a better insight of various RDMA operations. Further,a case study of INFINISWAP helps to realize how RDMA can be used to access remote memory of servers in the cluster quickly rather than accessing from the disks and thereby improves the performance of a set of applications.

              ACKNOWLEDGEMENTS

              First and foremost I would like to express my deep sense of gratitude to the Indian Academy of Sciences (IASc-INSA-NASI) for giving me opportunity to carry out this project in the Summer Research Fellowship 2018.

              I owe my heartiest gratitude to respected Dr. J Lakshmi, Principal Research Scientist, SERC, Indian Institute of Science, Bangalore who spared time to guide, encourage and keep me on the correct path and allowing me to carry out my project at this esteemed institution.

              I thank Mr.Anubhav Guleria for his continuous guidance throughout my project.

              It is my privilege to express my regards to Dr. Vijay S. Rajpurohit for providing my letter of recommendation to Indian Academy of Sciences.

              I would also like to thank my institute K L S Gogte Institute Of Technology for allowing me to pursue my internship.

              I would like to express my deepest sense of gratitude towards my beloved parents , fellow intern and friends who have supported me with their valuable suggestion and guidance.

              APPENDIX 1

              Below is a sample RDMA program that I have implemented wherein client sends 1 million numbers to the server and the server replies back by scaling those numbers by a factor of 2.

              Client Side :-

              #include <stdio.h>

              #include <stdlib.h>

              #include <stdint.h>

              #include <string.h>

              #include <sys/types.h>

              #include <sys/socket.h>

              #include <netdb.h>

              #include <arpa/inet.h>

              #include <rdma/rdma_cma.h>

              enum {

              RESOLVE_TIMEOUT_MS = 5000,

              };

              struct pdata {

              uint64_t    buf_va;

              uint32_t    buf_rkey;

              };

              /*declaration*/

              struct pdata                    server_pdata;

               struct rdma_event_channel       *cm_channel;

               struct rdma_cm_id               *cm_id;

               struct rdma_cm_event                *event;

               struct rdma_conn_param          conn_param = { };

               struct ibv_pd                   *pd;

               struct ibv_comp_channel         *comp_chan;

               struct ibv_cq                   *cq;

               struct ibv_cq                   *evt_cq;

               struct ibv_mr                   *mr;

               struct ibv_qp_init_attr         qp_attr = { };

               struct ibv_sge                  sge;

               struct ibv_send_wr              send_wr = { };

               struct ibv_send_wr              *bad_send_wr;

               struct ibv_recv_wr              recv_wr = { };

               struct ibv_recv_wr              *bad_recv_wr;

               struct ibv_wc                   wc;

               void                            *cq_context;

               struct addrinfo                 *res, *t;

               struct addrinfo                 hints = {

                   .ai_family = AF_INET,

                   .ai_socktype = SOCK_STREAM

                };

                  int                             n;

                  int                     *buf;

                  int                             err;

              int pre_conn(char *argv)

              {

                  /*create event channel*/

                  cm_channel = rdma_create_event_channel();

                  if (!cm_channel)

                      return 1;

                  err = rdma_create_id(cm_channel, &cm_id, NULL, RDMA_PS_TCP);

                  if (err)

                      return err;

                  n = getaddrinfo(argv, "20069", &hints, &res);

                  if (n < 0)

                      return 1;

                  

                  /*resolve the address and route*/

                  for (t = res; t; t = t->ai_next) {

                      err = rdma_resolve_addr(cm_id, NULL, t->ai_addr, RESOLVE_TIMEOUT_MS);

                      if (!err)

                          break;

                  }

                  if (err)

                      return err;

                  err = rdma_get_cm_event(cm_channel, &event);

                  if (err)

                      return err;

                  if (event->event != RDMA_CM_EVENT_ADDR_RESOLVED)

                      return 1;

                  rdma_ack_cm_event(event);

                  err = rdma_resolve_route(cm_id, RESOLVE_TIMEOUT_MS);

                  if (err)

                      return err;

                  err = rdma_get_cm_event(cm_channel, &event);

                  if (err)

                      return err;

                  if (event->event != RDMA_CM_EVENT_ROUTE_RESOLVED)

                      return 1;

                  rdma_ack_cm_event(event);

                  

                  /*allocate protection domain*/

                  pd = ibv_alloc_pd(cm_id->verbs);

                  if (!pd)

                      return 1;

                  comp_chan = ibv_create_comp_channel(cm_id->verbs);

                  if (!comp_chan)

                      return 1;

                  cq = ibv_create_cq(cm_id->verbs, 2,NULL, comp_chan, 0);

                  if (!cq)

                      return 1;

                  if (ibv_req_notify_cq(cq, 0))

                      return 1;

                  

                  if (!buf)

                      return 1;

                  mr = ibv_reg_mr(pd, buf,1000000* sizeof(int), IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_READ|IBV_ACCESS_REMOTE_WRITE);

                  if (!mr)

                      return 1;

                  qp_attr.cap.max_send_wr = 1;

                  qp_attr.cap.max_send_sge = 1;

                  qp_attr.cap.max_recv_wr = 1;

                  qp_attr.cap.max_recv_sge = 1;

                  qp_attr.send_cq = cq;

                  qp_attr.recv_cq = cq;

                  qp_attr.qp_type = IBV_QPT_RC;

                  err = rdma_create_qp(cm_id, pd, &qp_attr);

                  if (err)

                      return err;

                  conn_param.initiator_depth = 1;

                  conn_param.retry_count = 7;

                  return 0;/*added*/  

              }

              /*connecting to server*/

              int conn_send_data()

              {

              err = rdma_connect(cm_id, &conn_param);

                  if (err)

                                  return err;

                  err = rdma_get_cm_event(cm_channel,&event);

                  if (err)

                                  return err;

                  if (event->event != RDMA_CM_EVENT_ESTABLISHED)

                                  return 1;

                  memcpy(&server_pdata, event->param.conn.private_data, sizeof server_pdata);

                  rdma_ack_cm_event(event);

                  /* Prepost*/

                  sge.addr = (uintptr_t) buf;

                  sge.length =1000000* sizeof (int);

                  sge.lkey = mr->lkey;

                  recv_wr.wr_id = 0;

                  recv_wr.sg_list = &sge;

                  recv_wr.num_sge = 1;

                  if (ibv_post_recv(cm_id->qp, &recv_wr, &bad_recv_wr))

                                  return 1;

                  

                  for(int i=0;i<1000000;i++)

                  buf[i]=htonl(buf[i]);

                  sge.addr                     = (uintptr_t) buf;

                  sge.length =1000000* sizeof (int);

                  sge.lkey = mr->lkey;

                  send_wr.wr_id = 1;

                  send_wr.opcode = IBV_WR_SEND;

                  send_wr.send_flags   =IBV_SEND_SIGNALED;

                  send_wr.sg_list = &sge;

                  send_wr.num_sge = 1;

                  send_wr.wr.rdma.rkey = ntohl(server_pdata.buf_rkey);

                  send_wr.wr.rdma.remote_addr = ntohl(server_pdata.buf_va);

                  

                  if (ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr))

                      return 1;

              }

              /*receive reply from server*/

              int recv_answer()

              {

                  printf("\n\nThe numbers received after scaling by factor of 2 are\n");

                  while (1) {

                      if (ibv_get_cq_event(comp_chan,&evt_cq, &cq_context))

                          return 1;

                      if (ibv_req_notify_cq(cq, 0))

                          return 1;

                      if (ibv_poll_cq(cq, 1, &wc) != 1)

                          return 1;

                      if (wc.status != IBV_WC_SUCCESS)

                          return 1;

                      if (wc.wr_id == 0) {

                      for(int i=0;i<1000000;i++)  

                      printf("%d ", ntohl(buf[i]));

                          return 0;

                      }

              }

              return 0;

              }

              int main(int argc, char *argv[ ])

              {

                  int *arr;

                  int err;

                  arr=(int *)malloc(100*sizeof(int));

                  printf("The numbers sent to the server are\n ");

                  for(int i=0;i<1000000;i++){

                  arr[i]=rand()%100;

                  printf("%d ",arr[i]);

                  }   

                  buf=arr;

                  

                  err=pre_conn(argv[1]);

                  if(!err)

                  {

                      err=conn_send_data();

                      if(!err)

                      {

                          err=recv_answer();

                      }    

                  }

              }

              Server Side :-

              #include <stdlib.h>

              #include <stdint.h>

              #include <arpa/inet.h>

              #include<stdio.h>

              #include <infiniband/arch.h>

              #include <rdma/rdma_cma.h>

              /*declarations*/

              enum {

              RESOLVE_TIMEOUT_MS = 5000,

              };

              struct pdata {

              uint64_t buf_va;

              uint32_t buf_rkey;

              };

              struct pdata rep_pdata;

              struct rdma_event_channel *cm_channel;

              struct rdma_cm_id *listen_id;

              struct rdma_cm_id *cm_id;

              struct rdma_cm_event *event;

              struct rdma_conn_param conn_param = { };

              struct ibv_pd *pd;

              struct ibv_comp_channel *comp_chan;

              struct ibv_cq *cq;

              struct ibv_cq *evt_cq;

              struct ibv_mr *mr;

              struct ibv_qp_init_attr qp_attr = { };

              struct ibv_sge sge;

              struct ibv_send_wr send_wr = { };

              struct ibv_send_wr *bad_send_wr;

              struct ibv_recv_wr recv_wr = { };

              struct ibv_recv_wr *bad_recv_wr;

              struct ibv_wc wc;

              void *cq_context;

              struct sockaddr_in sin1;

              int *buf;

              int err;

              /* Set up RDMA CM structures */

              /*returns 0 on sucess , 1 on failure*/

              int set_rdma_cm_str()

              {

              cm_channel = rdma_create_event_channel();

              if (!cm_channel)

              return 1;

              err = rdma_create_id(cm_channel, &listen_id, NULL, RDMA_PS_TCP);

              if (err)

                   return err;

              sin1.sin_family = AF_INET;

              sin1.sin_port = htons(20069);

              sin1.sin_addr.s_addr = INADDR_ANY;

                  return 0;

              }

              /*binding socket and listening at the port for connection request*/

              /*returns 0 on sucess , 1 on failure*/

              int bindsock()

              {

                  

              err = rdma_bind_addr(listen_id, (struct sockaddr *) &sin1);

              if (err)

              return err;

              err = rdma_listen(listen_id, 1);

              if (err)

              return err;

              err = rdma_get_cm_event(cm_channel, &event);

              if (err)

              return err;

              if (event->event != RDMA_CM_EVENT_CONNECT_REQUEST)

              return 1;

              cm_id = event->id;

              rdma_ack_cm_event(event);

                  return 0;

              }

              /* Create verbs objects now that we know which device to use */

              int create_verb_obj()

              {   

              pd = ibv_alloc_pd(cm_id->verbs);

              if (!pd)

              return 1;

              comp_chan = ibv_create_comp_channel(cm_id->verbs);

              if (!comp_chan)

              return 1;

              cq = ibv_create_cq(cm_id->verbs, 2, NULL, comp_chan, 0);

              if (!cq)

              return 1;

              if (ibv_req_notify_cq(cq, 0))

              return 1;

              buf = calloc(1000000, sizeof (int));

              if (!buf)

              return 1;

              mr = ibv_reg_mr(pd, buf, 1000000 * sizeof (int),

              IBV_ACCESS_LOCAL_WRITE |

              IBV_ACCESS_REMOTE_READ |

              IBV_ACCESS_REMOTE_WRITE);

              if (!mr)

              return 1;

              qp_attr.cap.max_send_wr = 1;

              qp_attr.cap.max_send_sge = 1;

              qp_attr.cap.max_recv_wr = 1;

              qp_attr.cap.max_recv_sge = 1;

              qp_attr.send_cq = cq;

              qp_attr.recv_cq = cq;

              qp_attr.qp_type = IBV_QPT_RC;

              err = rdma_create_qp(cm_id, pd, &qp_attr);

              if (err)

              return err;

              }

              /*Posting receive before accepting connection */

              /*returns 0 on sucess , 1 on failure*/

              int post_recv()

              {    

              sge.addr = (uintptr_t) buf;

              sge.length = 1000000*sizeof (int);

              sge.lkey = mr->lkey;

              recv_wr.sg_list = &sge;

              recv_wr.num_sge = 1;

              if (ibv_post_recv(cm_id->qp, &recv_wr, &bad_recv_wr))

              return 1;

              rep_pdata.buf_va = htonl((uintptr_t) buf);

              rep_pdata.buf_rkey = htonl(mr->rkey);

              conn_param.responder_resources = 1;

              conn_param.private_data = &rep_pdata;

              conn_param.private_data_len = sizeof rep_pdata;

              return 0;

              }

              /*accepting connection*/

              /*returns 0 on sucess , 1 on failure*/

              int accept_conn()

              {    

              err = rdma_accept(cm_id, &conn_param);

              if (err)

              return 1;

              err = rdma_get_cm_event(cm_channel, &event);

              if (err)

              return err;

              if (event->event != RDMA_CM_EVENT_ESTABLISHED)

              return 1;

              rdma_ack_cm_event(event);

                  return 0;

              }

              /* Wait for receive completion*/

              /*returns 0 on sucess , 1 on failure*/

              int wait_recv_comp()

              if (ibv_get_cq_event(comp_chan, &evt_cq, &cq_context))

              return 1;

              if (ibv_req_notify_cq(cq, 0))

              return 1;

              if (ibv_poll_cq(cq, 1, &wc) < 1)

              return 1;

              if (wc.status != IBV_WC_SUCCESS)

              return 1;    

                  return 0;

              }

              /*scaling operation*/

              void operation()

              {  

                  for(int i=0;i<1000000;i++)

                buf[i] = htonl(ntohl(buf[i]) *2);

              }

              /*sending result and waiting for receive completion*/

              /*returns 0 on sucess , 1 on failure*/

              int post_conn()

              {

              sge.addr = (uintptr_t) buf;

              sge.length = 1000000*sizeof (int);

              sge.lkey = mr->lkey;

              send_wr.opcode = IBV_WR_SEND;

              send_wr.send_flags = IBV_SEND_SIGNALED;

              send_wr.sg_list = &sge;

              send_wr.num_sge = 1 ;    

              if (ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr))

              return 1;

              if (ibv_get_cq_event(comp_chan, &evt_cq, &cq_context))

              return 1;

              if (ibv_poll_cq(cq, 1, &wc) < 1)

              return 1;

              if (wc.status != IBV_WC_SUCCESS)

              return 1;

              ibv_ack_cq_events(cq, 2);

                   return 0;

              }

              int main(int argc, char *argv[])

              {    

                  int err;   

                  err=set_rdma_cm_str();    

                  if(!err)

                  {

                      err=bindsock();

                      if(!err)

                      {   err=create_verb_obj();

                          if(!err){               

                              err=post_recv();

                              if(!err){                   

                                  err=accept_conn();

                                  if(!err){                       

                                      err=wait_recv_comp();

                                      if(!err) {

                                          operation();                            

                                          err=post_conn();

                                          if(err){

                                              printf("Error in post connection\n");

                                              }

                                          }

                                      else

                                      printf("Error in wait recv connection\n");

                          

                                      }

                                  else

                                  printf("Error in accept connection\n");

                                  }

                              else

                              printf("Error in post receive\n");

                              }

                          else

                          printf("Error in creating verb object\n");

                      }

                  else

                  printf("Error in creating rdma cm structures\n");

                  }    

              printf("Data sent sucessfully\n");    

              }

              References

              • https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-introduction-to-rdma/

              • http://www.roceinitiative.org/software-based-roce-a-new-way-to-experience-rdma/

              • https://community.mellanox.com/docs/DOC-2184

              • https://github.com/linzion/RDMA-example-application

              •   Gu, J., Lee, Y., Zhang, Y., Chowdhury, M. and Shin, K.G., 2017, March. Efficient Memory Disaggregation with Infiniswap. In NSDI (pp. 649-667).  

              More
              Written, reviewed, revised, proofed and published with