Monday, February 12, 2018

JBOF (just a bunch of flash)

  • JBOF (just a bunch of flash)
Facebook is announcing its flexible NVMe JBOF (just a bunch of flash), Lightning.
http://www.storagereview.com/facebook_introduces_a_flexible_nvme_jbof

  • "NVMe in a shared chassis look like an internal drive – so it's not shared data. It's not a SAN. You cannot run VMFS (VMware file system) on top of that."
With NVMeF there is a remote direct memory access from a host server to a drive in an NVMeF array.
If you have direct host server access to the array's drives across NVMeF then the array is effectively controller-less, and just a bunch of flash drives (JBOF)
"A SAN? That's associated with Fibre Channel and SCSI. I would call it NVMeF. This is a SAN replacement or next-generation of SAN. It may be a bit confusing to call it a SAN."
https://www.theregister.co.uk/2017/06/23/is_an_nvme_over_fabrics_array_a_san/
  • Is an NVMe over fabrics array a SAN? "NVMe in a shared chassis look like an internal drive – so it's not shared data. It's not a SAN. You cannot run VMFS (VMware file system) on top of that." With NVMeF there is a remote direct memory access from a host server to a drive in an NVMeF array. If you have direct host server access to the array's drives across NVMeF then the array is effectively controller-less, and just a bunch of flash drives (JBOF)
    "A SAN? That's associated with Fibre Channel and SCSI. I would call it NVMeF. This is a SAN replacement or next-generation of SAN. It may be a bit confusing to call it a SAN."
     





Friday, February 9, 2018

HPC

  • The Open Science Grid (OSG) provides common service and support for resource providers and scientific institutions (i.e., "sites") using a distributed fabric of high throughput computational services.This documentation aims to provide HTC/HPC system administrators with the necessary information to contribute resources to the OSG.An existing compute cluster running on a supported operating system with a supported batch system: Grid Engine, HTCondor, LSF, PBS Pro/Torque, or Slurm.
https://opensciencegrid.org/docs/

    • WekaIO Matrix and Univa Grid Engine – Extending your On-Premise HPC Cluster to the Cloud
    The capacity of the file system will depend on the number of cluster hosts, and the size and number of SSDs on each host. For example, if I need 100TB of shared usable storage, I can accomplish this by using a cluster of 12 x i3.16xlarge storage dense instances on AWS where each instance has 8 x 1,900GB NVMe SSDs for a total of 15.2 TB of SSD storage per instance. A twelve node cluster, in theory, would have 182.4 TB of capacity, but if we deploy a stripe size of eight (6N + 2P) approximately 25% of our capacity will be dedicated to storing parity. 10% of the Matrix file system is recommended to be held in reserve for internal housekeeping and caching so the usable capacity of the twelve node cluster is ~123 TB (aGridFTPssuming no backing store).
    https://blogs.univa.com/2019/06/wekaio-matrix-and-univa-gridgridftp-engine-extending-your-on-premise-hpc-cluster-to-the-cloud/

    • What Is High-Performance Computing?
    High-performance computing (HPC) is the ability to process data and perform complex calculations at high speeds. 
    One of the best-known types of HPC solutions is the supercomputer. A supercomputer contains thousands of compute nodes that work together to complete one or more tasks. This is called parallel processing. It’s similar to having thousands of PCs networked together, combining compute power to complete tasks faster.
    https://www.netapp.com/us/info/what-is-high-performance-computing.aspx
    • Open Source High-Performance Computing
      The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is, therefore, able to combine the expertise, technologies, and resources from all across the High-Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.
      https://www.open-mpi.org/

    • Introduction to high performance computing, what is it, how to use it and when


    Simply a multi-user, shared and smart batch processing system
    Improves the scale & size of processing significantly
    With raw power & parallelization
    Thanks to rapid advances in low cost micro-processors, high-speed networks and optimized software

    https://www.slideshare.net/raamana/high-performance-computing-with-checklist-and-tips-optimal-cluster-usage
    • Some workloads have very high I/O throughput, so to make sure these requirements are met, the kernel uses schedulers.
    Schedulers do exactly what they say: schedule activities within the kernel so that system activities and resources are scheduled to achieve an overall goal for the system. This goal could be low latency for input (as embedded systems require), better interactivity, faster I/O, or even a combination of goals. Primarily, schedulers are concerned with CPU resources, but they could also consider other system resources (e.g., memory, input devices, networks, etc.).
    http://www.admin-magazine.com/HPC/Articles/Linux-I-O-Schedulers?utm_source=ADMIN+Newsletter&utm_campaign=HPC_Update_110_2018-03-22_Linux_I%2FO_Schedulers&utm_medium=email
    • The general goal of HPC is either to run applications faster or to run problems that can’t or won’t run on a single server. To do this, you need to run parallel applications across separate nodes.If you are interested in parallel computing using multiple nodes, you need at least two separate systems (nodes), each with its own operating system (OS). To keep things running smoothly, the OS on both nodes should be identical. (Strictly speaking, it doesn’t have to be this way, but otherwise, it is very difficult to run and maintain.) If you install a package on node 1, then it needs to be installed on node 2 as well. This lessens a source of possible problems when you have to debug the system.
    http://www.admin-magazine.com/HPC/Articles/Building-an-HPC-Cluster


    • OFED (OpenFabrics Enterprise Distribution) is a package that developed and released by the OpenFabrics Alliance (OFA), as a joint effort of many companies that are part of the RDMA scene. It contains the latest upstream software packages (both kernel modules and userspace code) to work with RDMA. It supports InfiniBand, Ethernet and RoCE transport.
    http://www.rdmamojo.com/2014/11/30/working-rdma-using-ofed/
    • I explained how to install the RDMA stack in several ways (inbox, OFED and manually).
    http://www.rdmamojo.com/2015/01/24/verify-rdma-working/

    • Installation of RDMA stack manually
    It is possible to install the RDMA packages, in a convenient and simple way, using the OFED distribution or using the packages that shipped within the Linux distributions.
    http://www.rdmamojo.com/2014/12/27/installation-rdma-stack-manually/

    • OFED (OpenFabrics Enterprise Distribution) is a package that developed and released by the OpenFabrics Alliance (OFA), as a joint effort of many companies that are part of the RDMA scene.
    It contains the latest upstream software packages (both kernel modules and userspace code) to work with RDMA. It supports InfiniBand, Ethernet and RoCE transports
    http://www.rdmamojo.com/2014/11/30/working-rdma-using-ofed/
    • What is RoCE?

    RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to dramatically accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory to memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by hardware resulting in dramatically lower latency, higher throughput, and better performance compared to software based protocols.
    What are the differences between RoCE v1 and RoCE v2?
    As originally implemented and standardized by the InfiniBand Trade Association (IBTA) RoCE was envisioned as a layer 2 protocol.
    These limitations have driven the demand for RoCE to operate in layer 3 (routable) environments. Fortunately a straightforward extension of the RoCE framework allows it to be readily transported across layer 3 networks.
    https://community.mellanox.com/docs/DOC-1451

    • Remote Direct Memory Access (RDMA) provides direct memory access from the memory of one host (storage or compute) to the memory of another host without involving the remote Operating System and CPU, boosting network and host performance with lower latency, lower CPU load and higher bandwidth.
    http://www.mellanox.com/page/products_dyn?product_family=79

    • RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed.
    https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

    • RoCE versus InfiniBand
    RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network
    RoCE versus iWARP
    While the RoCE protocols define how to perform RDMA using Ethernet and UDP/IP frames, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the Transmission Control Protocol (TCP).RoCE v2 and iWARP packets are routable.
    https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

    • RDMA  or  Remote  Direct  Memory  Access, communications        using        Send/Receive       
    semantics and kernel bypass technologies in server  and  storage  interconnect  products
    permit   high   through-put   and   low-latency   networking.

    The  set  of  standards,  defined  by  the  Data Center   Bridging   (DCB)   task   group   within
    IEEE 802.1 is popularly known as Converged Enhanced    Ethernet    (CEE).    The    primary
    target  of  CEE  is  the  convergence  of  Inter  Process  Communication  (IPC),  networking
    and  storage  traffic  in  the  data  center. the   lossless   CEE  functionality  is  conceptually
    similar  to  the  features  offered  by  the  InfiniBand  data  link  layer and includes:IEEE   802.1Qbb   Priority   flow   control  (PFC)   standardizes   a   link   level   flow control  that  recognizes  8  traffic  classes
    per port (analogous to InfiniBand virtual lanes).

    The   CEE   new   standards   include:   802.1Qbb   –   Priority-based   flow
    control,  802.1Qau  –  End-to-End  Congestion  Notification,   and   802.1Qaz   –   Enhanced
    Transmission   Selection   and   Data   Center   Bridge   Exchange.

    The   lossless   delivery   features in CEE enables a natural choice for
    building  RDMA,  SEND/RECEIVE  and  kernel  bypass  services  over  CE
    E  is  to  apply  RDMA  transport   services   over   CEE   or   in   short   RoCE.

    ROCE   is   compliant   with   the   OFA   verbs   definition and is interoperable with the OFA
    software  stack  (similar  to  InfiniBand  and  iWARP).

    Many    Linux    distributions,  which  include  OFED,  support  a  wide    and    rich    range    of    middleware    and
    application   solutions   such   as   IPC,   sockets, messaging, virtualization, SAN, NAS, file systems and databases, which enable RoCE to deliver all
    three   dimensions   of   unified   networking   on  Ethernet – IPC, NAS and SAN.

    RoCE,  focuses  on  short  range  server  to  server  and  server  to  storage  networks,
    delivering  the  lowest  latency  and  jitter characteristics    and    enabling    simpler
    software                and                hardware                implementations

    RoCE  supports  the  OFA  verbs  interface  seamlessly. The OFA verbs used by RoCE
    are  based  on  IB  and  have  been  proven in   large   scale   deployments   and   with
    multiple  ISV  applications,  both  in  the  HPC  and  EDC  sectors
    http://www.itc23.com/fileadmin/ITC23_files/papers/DC-CaVES-PID2019735.pdf

    • Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth.[5] The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol.[6] There exist RoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds[7][8] while the lowest known iWARP HCA latency in 2011 was 3 microseconds

    The RoCE v1 protocol is an Ethernet link layer protocol with ethertype 0x8915.This means that the frame length limits of the Ethernet protocol apply: 1500 bytes for a regular Ethernet frame and 9000 bytes for a jumbo frame.

    The RoCEv2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol.The UDP destination port number 4791 has been reserved for RoCE v2.Since RoCEv2 packets are routable the RoCE v2 protocol is sometimes called Routable RoCE[11] or RRoCE.[3] Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered

    https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

    • RoCE versus InfiniBand
    RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network
    RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.[15] Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet
    https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet



    • RoCE is a standard protocol defined in the InfiniBand Trade Association (IBTA) standard. RoCE makes use of UDP encapsulation allowing it to transcend Layer 3 networks. RDMA is a key capability natively used by the InfiniBand interconnect technology. Both InfiniBand and Ethernet RoCE share a common user API but have different physical and link layers.
    http://www.mellanox.com/page/products_dyn?product_family=79


    • Why RoCE?
    RDMA over Converged Ethernet (RoCE) allows all the advantages of RDMA, but on existing Ethernet
    networks. With RoCE there is no need to convert a data center from Ethernet to InfiniBand, saving
    companies massive amounts of capital expenditures. There is no difference to an application between
    using RDMA over InfiniBand or over Ethernet, so application writers who are more comfortable in an
    Ethernet environment are well-covered by RoCE.
    http://www.mellanox.com/related-docs/whitepapers/roce_in_the_data_center.pdf





    • Remote Direct Memory Access (RDMA) provides direct access from the memory of one computer to
    the memory of another without involving either computer’s operating system. This technology enables
    high-throughput, low-latency networking with low CPU utilization, which is especially useful in massively
    parallel compute clusters
    http://www.mellanox.com/related-docs/whitepapers/WP_RoCE_vs_iWARP.pdf
    • There are two different RDMA performance testing packages included with Red Hat Enterprise Linux, qperf and perftest. Either of these may be used to further test the performance of an RDMA network.
    https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-testing_an_rdma_network_after_ipoib_is_configured

    • InfiniBand is a great networking protocol that has many features and provides great performance. However, unlike RoCE and iWARP, which are running over an Ethernet infrastructure (NIC, switches and routers) and support legacy (IP-based) applications, by design. InfiniBand, as a completely different protocol, uses a different addressing mechanism, which isn't IP, and doesn't support sockets - therefore it doesn't support legacy applications. This means that in some clusters InfiniBand can't be used as the only interconnect since many management systems are IP-based. The result of this could have been that clusters that use InfiniBand for the data traffic may deploy an Ethernet infrastructure in the cluster as well (for the management). This increases the price and complexity of cluster deployment.


    IP over InfiniBand (IPoIB) architecture

    The IPoIB module registers to the local Operating System's network stack as an Ethernet device and translate all the needed functionality between Ethernet and InfiniBand.
    Such IPoIB network interface is created for every port of the InfiniBand device. In Linux, the prefix of those interfaces is "ib".

    The traffic that is sent over the IPoIB network interface uses the network stack of the kernel and doesn't benefit from features of the InfiniBand device:
    kernel bypass
    reliability
    zero copy
    splitting and assembly of messages to packets, and more.

    IPoIB Limitations
    IPoIB solves many problems for us. However, compared to a standard Ethernet network interface, it has some limitations:

        IPoIB supports IP-based application only (since the Ethernet header isn't encapsulated).
        SM/SA must always be available in order for IPoIB to function.
        The MAC address of an IPoIB network interface is 20 bytes.
        The MAC address of the network interface can't be controlled by the user.
        The MAC address of the IPoIB network interface may change in consecutive loading of the IPoIB module and it isn't persistent (i.e. a constant attribute of the interface).
        Configuring VLANs in an IPoIB network interface requires awareness of the SM to the corresponding P_Keys.
        Non-standard interface for managing VLANs.

    http://www.rdmamojo.com/2015/02/16/ip-infiniband-ipoib-architecture/

    • RDMA support for Infiniband, RoCE (RDMA over Converged Ethernet), and Omni-Path in BeeGFS is based on the Open Fabrics Enterprise Distribution ibverbs API (http://www.openfabrics.org).
    https://www.beegfs.io/wiki/NativeInfinibandSupport

    • Competing networks
    General purpose network (e.g. Ethernet)
    High-performance network with RDMA (e.g. InfiniBand)
    http://www.cs.cmu.edu/~wtantisi/files/nfs-pdl08-poster.pdf
    • In our test environment we used two 240GB Micron M500DC SSDs in RAID0 in each of our two nodes. We connected the two peers using Infiniband ConnectX-4 10Gbe. We then ran a series of tests to compare the performance of DRBD disconnected (not-replicating), DRBD connected using TCP over Infiniband, and DRBD connected using RDMA over Infiniband, all against the performance of the backing disks without DRBD.

    For testing random read/write IOPs we used fio with 4K blocksize and 16 parallel jobs. For testing sequential writes we used dd with 4M blocks. Both tests used the appropriate flag for
    direct IO in order to remove any caching that might skew the results.

    https://www.linbit.com/en/drbd-9-over-rdma-with-micron-ssds/

    • Ethernet and Infiniband are common network connections  used  in  the  supercomputing  world. Infiniband  was  introduced  in  2000  to  tie  memory  and the  processes  of  multiple  servers  together,  so  communication speeds would be as fast as if they were on the same PCB
    http://journals.sfu.ca/jsst/index.php/jsst/article/viewFile/22/pdf_16

    • the value of Infiniband as a storage protocol to replace FibreChannel with several SSD vendors offering Infiniband options. Most likely this is necessary to allow servers to get enough network speed but mostly to reduce latency and CPU consumption. Good Infiniband networks have latency measured in hundreds of nanoseconds and much lower impact on system CPU because Infiniband uses RDMA to transfer data. RDMA
    http://etherealmind.com/infiniband-ethernet-rocee-better-than-dcb/
    • Soft RoCE is a software implementation that allows RoCE to run on any Ethernet network adapter whether it offers hardware acceleration or not. It leverages the same efficiency characteristics as RoCE, providing a complete RDMA stack implementation over any NIC. Soft RoCE driver implements the InfiniBand RDMA transport over the Linux network stack. It enables a system with standard Ethernet adapter to inter-operate with hardware RoCE adapter or with another system running Soft-RoCE.
    http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v4_3.pdf

    • Starting the RDMA services
    If one is using the InfiniBand transport and he doesn't have a managed switch in the subnet, he has to start the Subnet Manager (SM). Doing this in one of the machines in the subnet is enough, this can be done with the following command line:
    [root@localhost]# /etc/init.d/opensmd start

    2. RDMA needs to work with pinned memory, i.e. memory which cannot be swapped out by the kernel. By default, every process that is running as a non-root user is allowed to pin a low amount of memory (64KB). In order to work properly as a non-root user, it is highly recommended to increase the size of memory which can be locked. Edit the file /etc/security/limits.conf and add the following lines, if they weren't added by the installation script:
    * soft memlock unlimited
    * hard memlock unlimited
    http://www.rdmamojo.com/2014/11/22/working-rdma-using-mellanox-ofed/

    • Native Infiniband / RoCE / Omni-Path Support (RDMA)
    RDMA support for Infiniband, RoCE (RDMA over Converged Ethernet), and Omni-Path in BeeGFS is based on the Open Fabrics Enterprise Distribution ibverbs API (http://www.openfabrics.org).
    https://www.beegfs.io/wiki/NativeInfinibandSupport

    RDMA is Remote Dynamic Memory Access which is a way of moving buffers between two applications across a network. RDMA differs from traditional network interfaces because it bypasses the operating system. This allows programs that implement RDMA to have:



        The absolute lowest latency
        The highest throughput
        Smallest CPU footprint

    To make use of RDMA we need to have a network interface card that implements an RDMA engine.
    We call this an HCA (Host Channel Adapter). The adapter creates a channel from it’s RDMA engine though the PCI Express bus to the application memory.
    In RDMA we setup data channels using a kernel driver. We call this the command channel.
    We use the command channel to establish data channels which will allow us to move data bypassing the kernel entirely
    Once we have established these data channels we can read and write buffers directly.
    The API to establish these the data channels are provided by an API called “verbs”.

    The verbs API is a maintained in an open source linux project called the Open Fabrics Enterprise Distribution (OFED). (www.openfabrics.org).
    There is an equivalent project for Windows WinOF

    Queue Pairs
    RDMA operations start by “pinning” memory. When you pin memory you are telling the kernel that this memory is owned by the application. Now we tell the HCA to address the memory and prepare a channel from the card to the memory. We refer to this as registering a Memory Region. We can now use the memory that has been registered in any of the RDMA operations we want to perform
    RDMA is an asynchronous transport mechanism.

    https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-introduction-to-rdma/

    • The following configurations need to be modified after the installation is complete:

    /etc/rdma/rdma.conf
    /etc/udev.d/rules.d/70-persistent-ipoib.rules
    /etc/security/limits.d/rdma.conf

    The rdma service reads /etc/rdma/rdma.conf to find out which kernel-level and user-level RDMA protocols the administrator wants to be loaded by default. You should edit this file to
    turn various drivers on or off. The rdma package provides the file /etc/udev.d/rules.d/70-persistent-ipoib.rules. This udev rules file is used to rename IPoIB devices from
     their default names (such as ib0 and ib1) to more descriptive names. You should edit this file to change how your devices are named.
    RDMA communications require that physical memory in the computer be pinned (meaning that the kernel is not allowed to swap that memory out to a paging file in the event that the
    overall computer starts running short on available memory). Pinning memory is normally a very privileged operation. In order to allow users other than root to run large RDMA
    applications, it will likely be necessary to increase the amount of memory that non-root users are allowed to pin in the system. This is done by adding the file rdma.conf file in the
    /etc/security/limits.d/directory with contents

    Because RDMA applications are so different from Berkeley Sockets-based applications and from normal IP networking, most applications that are used on an IP network cannot be used
    directly on an RDMA network.
    https://lenovopress.com/lp0823.pdf

    • Dramatically improves the performance of socket-based applications
    Mellanox Messaging Accelerator (VMA) boosts the performance of message-based and streaming applications across a wide range of industries. These include the Financial Service market's High-Frequency Trading (HFT) platforms, Web2.0 clusters
    VMA is an Open Source library project that exposes standard socket APIs with kernel-bypass architecture, enabling user-space networking for Multicast, UDP unicast and TCP streaming. VMA also features ease of use out-of-the-box, with built-in preconfigured profiles such as for latency or streaming.
    http://www.mellanox.com/page/software_vma?mtag=vma
    • Using RDMA has the following major advantages:

        Zero-copy - applications can perform data transfer without the network software stack involvement and data is being sent received directly to the buffers without being copied between the network layers.
        Kernel bypass - applications can perform data transfer directly from userspace without the need to perform context switches.
        No CPU involvement - applications can access remote memory without consuming any CPU in the remote machine. The remote memory machine will be read without any intervention of a remote process (or processor). The caches in the remote CPU(s) won't be filled with the accessed memory content.
        Message-based transactions - the data is handled as discrete messages and not as a stream, which eliminates the need of the application to separate the stream into different messages/transactions.
        Scatter/gather entries support - RDMA supports natively working with multiple scatter/gather entries i.e. reading multiple memory buffers and sending them as one stream or getting one stream and writing it to multiple memory buffers

    You can find RDMA in industries that need at least one the following:

        Low latency - For example HPC, financial services, web 2.0
        High Bandwidth - For example HPC, medical appliances, storage, and backup systems, cloud computing
        Small CPU footprint  - For example HPC, cloud computing

    there are several network protocols which support RDMA:


        InfiniBand (IB) - a new generation network protocol which supports RDMA natively from the beginning. Since this is a new network technology, it requires NICs and switches which supports this technology.
        RDMA Over Converged Ethernet (RoCE) - a network protocol which allows performing RDMA over Ethernet network. Its lower network headers are Ethernet headers and its upper network headers (including the data) are InfiniBand headers. This allows using RDMA over standard Ethernet infrastructure (switches). Only the NICs should be special and support RoCE.
        Internet Wide Area RDMA Protocol (iWARP) - a network protocol which allows performing RDMA over TCP. There are features that exist in IB and RoCE and aren't supported in iWARP. This allows using RDMA over standard Ethernet infrastructure (switches). Only the NICs should be special and support iWARP (if CPU offloads are used) otherwise, all iWARP stacks can be implemented in SW and losing most of the RDMA performance advantages.



    http://www.rdmamojo.com/2014/03/31/remote-direct-memory-access-rdma/

    • kernel bypass



      • nvme over fabrics (NVMe-OF)
        NVMe basics

    • open fabric alliance
      The OpenFabrics Enterprise Distribution (OFED™)/OpenFabrics Software is open-source software for RDMA and kernel bypass applications.OFS is used in business, research and scientific environments that require highly efficient networks, storage connectivity, and parallel computing.
      OFS includes kernel-level drivers, channel-oriented RDMA and send/receive operations, kernel bypasses of the operating system, both kernel and user-level application programming interface (API) and services for parallel message passing (MPI), sockets data exchange (e.g., RDS, SDP), NAS and SAN storage (e.g. iSER, NFS-RDMA, SRP) and file system/database systems.
      The network and fabric technologies that provide RDMA performance with OFS include legacy 10 Gigabit Ethernet, iWARP for Ethernet, RDMA over Converged Ethernet (RoCE), and 10/20/40 Gigabit InfiniBand.
      OFS is used in high-performance computing for breakthrough applications that require high-efficiency computing, wire-speed messaging, microsecond latencies and fast I/O for storage and file systems.

      OFS delivers valuable benefits to end-user organizations, including high CPU efficiency, reduced energy consumption, and reduced rack-space requirements. OFS offers these benefits on commodity servers for academic, engineering, enterprise, research and cloud applications. OFS also provides investment protection as parallel computing and storage evolve toward exascale computing, and as networking speeds move toward 10 Gigabit Ethernet and 40 Gigabit InfiniBand in the enterprise data center.
      https://www.openfabrics.org/index.php/openfabrics-software.html

    • Facebook Introduces A Flexible NVMe JBOF
      At the Open Compute Project Summit today Facebook is announcing its flexible NVMe JBOF (just a bunch of flash), Lightning.
      Facebook is contributing this new JBOF to the Open Compute Project.
      Benefits include:

          Lightning can support a variety of SSD form factors, including 2.5", M.2, and 3.5" SSDs.
          Lightning will support surprise hot-add and surprise hot-removal of SSDs to make field replacements as simple and transparent as SAS JBODs.
          Lightning will use an ASPEED AST2400 BMC chip running OpenBMC.
          Lightning will support multiple switch configurations, which allows us to support different SSD configurations (for example, 15x x4 SSDs vs. 30x x2 SSDs) or different head node to SSD mappings without changing HW in any way.
          Lightning will be capable of supporting up to four head nodes. By supporting multiple head nodes per tray, we can adjust the compute-to-storage ratio as needed simply by changing the switch configuration.

      http://www.storagereview.com/facebook_introduces_a_flexible_nvme_jbof

    • The HPC Project is focused on developing a fully open heterogeneous computing, networking and fabric platform Optimized for a multi-node processor that is agnostic to any computing group using x86, ARM, Open Power, FPGA, ASICs, DSP, and GPU silicon on an hardware platform
      http://www.opencompute.org/projects/high-performance-computing-hpc/
      • NVMe Over Fabrics-NVMe-OF
      Benefits of NVMe over Fabrics for disaggregation
      https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/20150811_FA11_Burstein.pdf
    • Energy efficiency for green HPC
    This is the reason why Eurotech decided to concentrate on cooling and power conversion.Our systems are entirely liquid cooled with water up to 50 °C, making the use of air conditioning redundant. They also use a technology of power conversion with an efficiency of 97%.this means an Eurotech based data center can have a PUE of 1,05!
    https://www.eurotech.com/en/hpc/hpcomputing/energy+efficient+hpc


    • PUE for a dedicated building is the total facility energy divided by the IT equipment energy. PUE is an end-user metric used to help improve energy

    efficiency in data center operations.
    https://eehpcwg.llnl.gov/documents/infra/09_pue_past-tue_future.pdf

    • green HPC




    • The OpenFabrics Enterprise Distribution (OFED™)/OpenFabrics Software is open-source software for RDMA and kernel bypass applications. OFS is used in business, research and scientific environments that require highly efficient networks, storage connectivity, and parallel computing. The software provides high-performance computing sites and enterprise data centers with flexibility and investment protection as computing evolves towards applications that require extreme speeds, massive scalability, and utility-class reliability.
    OFS includes kernel-level drivers, channel-oriented RDMA and send/receive operations, kernel bypasses of the operating system, both kernel and user-level application programming interface (API) and services for parallel message passing (MPI), sockets data exchange (e.g., RDS, SDP), NAS and SAN storage (e.g. iSER, NFS-RDMA, SRP) and file system/database systems.
    The network and fabric technologies that provide RDMA performance with OFS include legacy 10 Gigabit Ethernet, iWARP for Ethernet, RDMA over Converged Ethernet (RoCE), and 10/20/40 Gigabit InfiniBand.
    OFS is available for many Linux and Windows distributions, including Red Hat Enterprise Linux (RHEL), Novell SUSE Linux Enterprise Distribution (SLES), Oracle Enterprise Linux (OEL) and Microsoft Windows Server operating systems. Some of these distributions ship OFS in-box.
    OFS for High-Performance Computing.
    OFS is used in high-performance computing for breakthrough applications that require high-efficiency computing, wire-speed messaging, microsecond latencies and fast I/O for storage and file systems
    OFS for Enterprise Data Centers
    OFS delivers valuable benefits to end-user organizations, including high CPU efficiency, reduced energy consumption, and reduced rack-space requirements.
    https://www.openfabrics.org/index.php/openfabrics-software.html

    • EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High-Performance Computing (HPC) systems in an efficient way.
    https://github.com/easybuilders/easybuild
    • Using the Open Build Service (OBS) to manage build process
    http://hpcsyspros.lsu.edu/2016/slides/OpenHPC-hycsyspro-overview.pdf

    • Welcome to the OpenHPC site. OpenHPC is a collaborative, community effort that initiated from a desire to aggregate a number of common ingredients required to deploy and manage High-Performance Computing (HPC) Linux clusters including provisioning tools, resource management, I/O clients, development tools, and a variety of scientific libraries. Packages provided by OpenHPC have been pre-built with HPC integration in mind with a goal to provide re-usable building blocks for the HPC community
    https://openhpc.community/

    • This stack provides a variety of common, pre-built ingredients required to deploy and manage an HPC Linux cluster including provisioning tools, resource management, I/O clients, runtimes, development tools, and a variety of scientific libraries.
    https://github.com/openhpc/ohpc

    • OpenHPC containers
    • OpenHPC software stack
    • OpenHPC v1.3 software components


    • OpenHPC (v1.3.4)
    Cluster Building Recipes

    CentOS7.4 Base OS
    Warewulf/PBS Professional Edition for Linux* (x86 64)

    This guide presents a simple cluster installation procedure using components from the OpenHPC software
    stack. OpenHPC represents an aggregation of a number of common ingredients required to deploy and manage an HPC Linux* cluster including

    provisioning tools,
    resource management,
    I/O clients,
    development tools,
    and a variety of scientific libraries

    These packages have been pre-built with HPC integration in mind while conforming to common Linux distribution standards.



    This guide is targeted at experienced Linux system administrators for HPC environments. Knowledge of
    software package management, system networking, and PXE booting are assumed.
    Unless specified otherwise, the examples presented are executed with elevated (root) privileges.
    The examples also presume the use of the BASH login shell

    This installation recipe assumes the availability of a single head node master, and four compute nodes.
    4-node stateless cluster installation
    The master node serves as the overall system management server (SMS) and is provisioned with CentOS7.4 and is
    subsequently configured to provision the remaining compute nodes with Warewulf in a stateless configuration

    The terms master and SMS are used interchangeably in this guide. For power management, we assume that
    the compute node baseboard management controllers (BMCs) are available via IPMI from the chosen master
    host.
    For file systems, we assume that the chosen master server will host an NFS file system that is made
    available to the compute nodes.
    Installation information is also discussed to optionally mount a parallel
    file system and in this case, the parallel file system is assumed to exist previously.

    The master host requires at least two Ethernet interfaces with eth0 connected to
    the local data center network and eth1 used to provision and manage the cluster backend
    Two logical IP interfaces are expected to each compute node: the first is the standard Ethernet interface that
    will be used for provisioning and resource management. The second is used to connect to each host’s BMC
    and is used for power management and remote console access.
    The second is used to connect to each host’s BMC
    and is used for power management and remote console access. Physical connectivity for these two logical
    IP networks is often accommodated via separate cabling and switching infrastructure; however, an alternate
    configuration can also be accommodated via the use of a shared NIC, which runs a packet filter to divert
    management packets between the host and BMC
    In addition to the IP networking, there is an optional high-speed network (InfiniBand or Omni-Path
    in this recipe) that is also connected to each of the hosts. This high-speed network is used for application
    message passing and optionally for parallel file system connectivity as well (e.g. to existing Lustre or BeeGFS
    storage targets)

    In an external setting, installing the desired BOS on a master SMS host typically involves booting from a
    DVD ISO image on a new server. With this approach, insert the CentOS7.4 DVD, power cycle the host, and
    follow the distro provided directions to install the BOS on your chosen master host

    •  http://openhpc.community/downloads/


    • xCAT offers complete management for HPC clusters, RenderFarms, Grids, WebFarms, Online Gaming Infrastructure, Clouds, Datacenters etc. xCAT provides an extensible framework that is base

    https://xcat.org/
    XCAT CLUSTER MANAGEMENT


    The following diagram shows the basic structure of xCAT

    • PBS Professional software optimizes job scheduling and workload management in high-performance computing (HPC) environments
    https://www.pbspro.org/

    • Warewulf is a scalable systems management suite originally developed to manage large high-performance Linux clusters.
    http://warewulf.lbl.gov/

    • Hence the emergence of the tangible need for supercomputers, able to perform a large number of checks required in an extremely short time.
    In addition, modern methods for the identification of cyber crimes increasingly involve techniques featuring cross-analysis of data coming from several different sources – and these techniques further increase the computing capacity required.
    The nowadays security solutions require a mix of software and hardware to boost the power of security algorithms, analyze enormous data quantities in real time, rapidly crypt and decrypt data, identify abnormal patterns, check identities, simulate attacks, validate software security proof, petrol systems, analyze video material and many more additional actions
    The Aurora on-node FPGAs allow for secure encryption/decryption to coexist with the fastest processors in the market. The Aurora Dura Mater rugged HPC, an outdoor portable highly robust system, can be deployed closer to where the data is captured to allow a number of on filed operations like for instance the processing of video surveillance data, bypassing network bottlenecks.
    https://www.eurotech.com/en/hpc/industry+solutions/cyber+security


    • This is particularly true with respect to operational cybersecurity that, at best, is seen as a necessary evil, and considered as generally restrictive of performance and/or functionality. As a representative high-performance open-computing  site,  NERSC  has  decided to place  as few roadblocks  as  possible  for  access  to  site  computational and networking resources
    https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Monday/03B-Mellander-Paper.pdf

    • Activate tools such as SELinux that will control the access through predefined roles, but most HPC centers do not activate it as it breaks down the normal operation of the Linux clusters and it becomes harder to debug when applications don’t work as expected
    https://idre.ucla.edu/sites/default/files/cybersecurity_hpc.pdf?x83242

    • The traffic among HPC systems connected through public or private network now is exclusively through encrypted protocols using OpenSSL such as ssh, sftp, https etc
    HPC resources are running some version of Linux operating system they all invariably run Iptables based firewall at the host level, which is the primary tool to restrict
    access to service ports from outside network.
    https://idre.ucla.edu/sites/default/files/cybersecurity_hpc.pdf?x83242

    • Some of the HPC centers do not rely on users in protecting their password. So they implemented what is called One Time Password (OTP) where users are given small calculator-like devices to generate a random key to login to the system. However, that is inconvenient for users, as they have to carry this device all the time with them and also add to the operating cost of HPC systems
    http://www.hpctoday.com/viewpoints/cyber-security-in-high-performance-computing-environment/2/


    • Protecting a High-Performance Computing (HPC) cluster against real-world cyber threats is a critical task nowadays, with the increasing trend to open and share computing resources.
    http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5633784


    • Security of the clusters accesses usually relies on Discretionary  Access  Control  (DAC) and sandboxing such as chroot,  BSD  Jail  [1]  or  Vserver  [2].  The DAC model has proven to be fragile [3].  Moreover,  virtual  containers  do not protect against  privilege  escalation  where  users  can get  administration  privileges  in  order  to  compromise  data
    confidentiality and integrity
    http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5633784

    • GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. The GridFTP protocol is based on FTP, the highly-popular Internet file transfer protocol. 

    http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/


    • What is GridFTP? •A protocol for efficient file transfer •An extension of FTP •Currently the de facto standard for moving large files across the Internet

    Using GridFTP: Some background •You move files between “endpoints” identified by a URI •One of these endpoints is often your local machine, but GridFTP can also be used to move data between two remote endpoints •X.509 (“grid”) certificates are used for authentication to a particular end-point –Usually you’ll have one certificate that you use at both end-points, but this need not be the case
    Features of GridFTP•Security with GSI, The Grid Security Infrastructure •Third party transfers •Parallel and striped transfer•Partial file transfer•Fault tolerance and restart •Automatic TCP optimisation
    https://www.eudat.eu/sites/default/files/GridFTP-Intro-Rome.pdf
    • This paper describes FLEXBUS, a flexible, high-performance onchip communication architecture featuring a dynamically configurable topology. FLEXBUS is designed to detect run-time variations in communication traffic characteristics, and efficiently adapt the topology of the communication architecture, both at the system-level, through dynamic bridge by-pass, as well as at the component-level, using component re-mapping.

    https://www10.edacafe.com/nbc/articles/1/212767/FLEXBUS-High-Performance-System-on-Chip-Communication-Architecture-with-Dynamically-Configurable-Topology-Technical-Paper

    • High Performance Computing (HPC) Fabric Builders

    Intel® Omni-Path Fabric (Intel® OP Fabric) has a robust architecture and feature set to meet the current and future demands of high performance computing (HPC) at a cost-competitive price to today's fabric. Fabric builders members are working to enable world-class solutions based on the Intel® Omni-Path Architecture (Intel® OPA) with a goal of helping end users build the best environment to meet their HPC needs.
    https://www.intel.com/content/www/us/en/products/network-io/high-performance-fabrics.html

    • High-Performance Computing Solutions

    In-memory high performance computing
    Build an HPE and Intel® OPA foundation
    Solve supercomputing challenges
    https://www.hpe.com/us/en/solutions/hpc-high-performance-computing.html
    Memory-Driven Computing: the ultimate composable infrastructure
    • RoCE vs. iWARP Q&A


    two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP.

    Q. Does RDMA use the TCP/UDP/IP protocol stack?
    A. RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don’t use Ethernet.

    Q. Can Software Defined Networking features like VxLANs be implemented on RoCE/iWARP NICs?
    A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality.

    Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP?
    A. RDMAP/DDP/MPA are stacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model.

    Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP?
    A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most.
    iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration as well as out-of-the-box support across LAN/MAN/WAN networks.

    Q. Is iWARP a lossless or losssy protocol?
    A. iWARP utilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality.
    http://sniaesfblog.org/roce-vs-iwarp-qa/


    • A Brief about RDMA, RoCE v1 vs RoCE v2 vs iWARP


    For those who are wondering what these words are, this is a post about Networking and an overview can be, how to increase your network speed without adding new servers or IB over your network, wherein IB stands for InfiniBand which is basically a networking communication standard majorly used in HPC that features very high throughput and very low latency, for what HPC is.

    DBC
    Data-Center Bridging (DCB) is an extension to the Ethernet protocol that makes dedicated traffic flows possible in a converged network scenario. DCB distinguishes traffic flows by tagging the traffic with a specific value (0-7) called a “CoS” value which stands for Class of Service. CoS values can also be referred to as “Priority” or “Tag”.
    PFC
    In standard Ethernet, we have the ability to pause traffic when the receive buffers are getting full. The downside of ethernet pause is that it will pause all traffic on the link. As the name already gives it away, Priority-based Flow Control (PFC) can pause the traffic per flow based on that specific Priority, in other words; PFC creates Pause Frames based on a traffic CoS value.
    ETS
    With DCB in-place the traffic flows are nicely separated from each other and can pause independently because of PFC but, PFC does not provide any Quality-of-Service (QoS). If your servers are able to fully utilize the full network pipe with only storage traffic, other traffic such as cluster heartbeat or tenant traffic may come in jeopardy.
    The purpose of Enhanced Transmission Selection (ETS) is to allocate bandwidth based on the different priority settings of the traffic flows, this way the network components share the same physical pipe but ETS makes sure that everyone gets the share of the pipe specified and prevent the “noisy neighbor” effect
    Data Center Bridging Exchange Protocol
    This protocol is better known as DCBX as in also an extension on the DCB protocol, where the “X” stands for eXchange.
    DCBX can be used to share information about the DCB settings between peers (switches and servers) to ensure you have a consistent configuration on your network
    http://innovativebit.blogspot.com/2018/06/a-brief-about-rdma-roce-v1-vs-roce-v2.html
    File Storage Types and Protocols for Beginners
    DAS SCSI
    FC SAN
    iSCSI
    NAS
    • SMB Multichannel and SMB Direct

    SMB Multichannel is the feature responsible for detecting the RDMA capabilities of network adapters to enable SMB Direct. Without SMB Multichannel, SMB uses regular TCP/IP with the RDMA-capable network adapters (all network adapters provide a TCP/IP stack along with the new RDMA stack).
    https://docs.microsoft.com/en-us/windows-server/storage/file-server/smb-direct

    • What is SMB Direct?

    SMB Direct is SMB 3 traffic over RDMA. So, what is RDMA? DMA stands for Direct Memory Access. This means that an application has access (read/write) to a hosts memory directly with CPU intervention. If you do this between hosts it becomes Remote Direct Memory Access (RDMA).
    https://www.starwindsoftware.com/blog/smb-direct-the-state-of-rdma-for-use-with-smb-3-traffic-part-i

    High-end GPFS HPC box gets turbo-charged
    DDN's GRIDScaler gets bigger scale-up and scale-out muscles
    • MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard

    https://www.mpich.org/
    Catalyst UK Program successfully drives development of Arm-based HPC systems
    • The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

    https://www.open-mpi.org/
    Best Practice Guide – Parallel I/O
    Can (HPC)Clouds supersede traditional High Performance Computing?
    • The MVAPICH2 software, based on MPI 3.1 standard, delivers the best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE networking technologies.

    http://mvapich.cse.ohio-state.edu/
    Building HPC Cloud with InfiniBand: Efficient Support in MVAPICH2, Xiaoyi Lu

    MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds
    Keynote: Designing HPC, BigD & DeepL MWare for Exascale - DK Panda, The Ohio State University

    OpenStack and HPC Network Fabrics
    HPC Networking in OpenStack: Part 1
    OpenStack and HPC Workload Management
    A Self-Adaptive Network for HPC Clouds: Architecture, Framework, and Implementation
    OpenStack and Virtualised HPC

    Making Containers Easier with HPC Container Maker
    SUSE Linux Enterprise High Perfor­mance Compu­ting
    OSC's Owens cluster being installed in 2016 is a Dell-built, Intel® Xeon® processor-based supercomputer.
    MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand
    Introduction to Linux & HPC
    One Library with Multiple Fabric Support
    • Intel® MPI Library is a multifabric message-passing library that implements the open-source MPICH specification. Use the library to create, maintain, and test advanced, complex applications that perform better on HPC clusters based on Intel® processors.


    https://software.intel.com/en-us/mpi-library
    Supermicro High Performance Computing (HPC) Solutions
    PDC Center for High Performance Computing