fakecineaste : 02/09/18

The Open Science Grid (OSG) provides common service and support for resource providers and scientific institutions (i.e., "sites") using a distributed fabric of high throughput computational services.This documentation aims to provide HTC/HPC system administrators with the necessary information to contribute resources to the OSG.An existing compute cluster running on a supported operating system with a supported batch system: Grid Engine, HTCondor, LSF, PBS Pro/Torque, or Slurm.

https://opensciencegrid.org/docs/

WekaIO Matrix and Univa Grid Engine – Extending your On-Premise HPC Cluster to the Cloud

The capacity of the file system will depend on the number of cluster hosts, and the size and number of SSDs on each host. For example, if I need 100TB of shared usable storage, I can accomplish this by using a cluster of 12 x i3.16xlarge storage dense instances on AWS where each instance has 8 x 1,900GB NVMe SSDs for a total of 15.2 TB of SSD storage per instance. A twelve node cluster, in theory, would have 182.4 TB of capacity, but if we deploy a stripe size of eight (6N + 2P) approximately 25% of our capacity will be dedicated to storing parity. 10% of the Matrix file system is recommended to be held in reserve for internal housekeeping and caching so the usable capacity of the twelve node cluster is ~123 TB (aGridFTPssuming no backing store).

https://blogs.univa.com/2019/06/wekaio-matrix-and-univa-gridgridftp-engine-extending-your-on-premise-hpc-cluster-to-the-cloud/

What Is High-Performance Computing?

High-performance computing (HPC) is the ability to process data and perform complex calculations at high speeds.

One of the best-known types of HPC solutions is the supercomputer. A supercomputer contains thousands of compute nodes that work together to complete one or more tasks. This is called parallel processing. It’s similar to having thousands of PCs networked together, combining compute power to complete tasks faster.

https://www.netapp.com/us/info/what-is-high-performance-computing.aspx

Open Source High-Performance Computing
The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is, therefore, able to combine the expertise, technologies, and resources from all across the High-Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.
https://www.open-mpi.org/

Introduction to high performance computing, what is it, how to use it and when

Simply a multi-user, shared and smart batch processing system
Improves the scale & size of processing signiﬁcantly
With raw power & parallelization
Thanks to rapid advances in low cost micro-processors, high-speed networks and optimized software

https://www.slideshare.net/raamana/high-performance-computing-with-checklist-and-tips-optimal-cluster-usage

Some workloads have very high I/O throughput, so to make sure these requirements are met, the kernel uses schedulers.

Schedulers do exactly what they say: schedule activities within the kernel so that system activities and resources are scheduled to achieve an overall goal for the system. This goal could be low latency for input (as embedded systems require), better interactivity, faster I/O, or even a combination of goals. Primarily, schedulers are concerned with CPU resources, but they could also consider other system resources (e.g., memory, input devices, networks, etc.).
http://www.admin-magazine.com/HPC/Articles/Linux-I-O-Schedulers?utm_source=ADMIN+Newsletter&utm_campaign=HPC_Update_110_2018-03-22_Linux_I%2FO_Schedulers&utm_medium=email

The general goal of HPC is either to run applications faster or to run problems that can’t or won’t run on a single server. To do this, you need to run parallel applications across separate nodes.If you are interested in parallel computing using multiple nodes, you need at least two separate systems (nodes), each with its own operating system (OS). To keep things running smoothly, the OS on both nodes should be identical. (Strictly speaking, it doesn’t have to be this way, but otherwise, it is very difficult to run and maintain.) If you install a package on node 1, then it needs to be installed on node 2 as well. This lessens a source of possible problems when you have to debug the system.

http://www.admin-magazine.com/HPC/Articles/Building-an-HPC-Cluster

OFED (OpenFabrics Enterprise Distribution) is a package that developed and released by the OpenFabrics Alliance (OFA), as a joint effort of many companies that are part of the RDMA scene. It contains the latest upstream software packages (both kernel modules and userspace code) to work with RDMA. It supports InfiniBand, Ethernet and RoCE transport.

http://www.rdmamojo.com/2014/11/30/working-rdma-using-ofed/

I explained how to install the RDMA stack in several ways (inbox, OFED and manually).

http://www.rdmamojo.com/2015/01/24/verify-rdma-working/

Installation of RDMA stack manually

It is possible to install the RDMA packages, in a convenient and simple way, using the OFED distribution or using the packages that shipped within the Linux distributions.
http://www.rdmamojo.com/2014/12/27/installation-rdma-stack-manually/

OFED (OpenFabrics Enterprise Distribution) is a package that developed and released by the OpenFabrics Alliance (OFA), as a joint effort of many companies that are part of the RDMA scene.

It contains the latest upstream software packages (both kernel modules and userspace code) to work with RDMA. It supports InfiniBand, Ethernet and RoCE transports
http://www.rdmamojo.com/2014/11/30/working-rdma-using-ofed/

What is RoCE?

RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to dramatically accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory to memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by hardware resulting in dramatically lower latency, higher throughput, and better performance compared to software based protocols.
What are the differences between RoCE v1 and RoCE v2?
As originally implemented and standardized by the InfiniBand Trade Association (IBTA) RoCE was envisioned as a layer 2 protocol.
These limitations have driven the demand for RoCE to operate in layer 3 (routable) environments. Fortunately a straightforward extension of the RoCE framework allows it to be readily transported across layer 3 networks.
https://community.mellanox.com/docs/DOC-1451

Remote Direct Memory Access (RDMA) provides direct memory access from the memory of one host (storage or compute) to the memory of another host without involving the remote Operating System and CPU, boosting network and host performance with lower latency, lower CPU load and higher bandwidth.

http://www.mellanox.com/page/products_dyn?product_family=79

RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed.

https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

RoCE versus InfiniBand

RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network
RoCE versus iWARP
While the RoCE protocols define how to perform RDMA using Ethernet and UDP/IP frames, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the Transmission Control Protocol (TCP).RoCE v2 and iWARP packets are routable.
https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

RDMA or Remote Direct Memory Access, communications using Send/Receive

semantics and kernel bypass technologies in server and storage interconnect products
permit   high   through-put   and   low-latency   networking.

The set of standards, defined by the Data Center   Bridging   (DCB)   task   group   within
IEEE 802.1 is popularly known as Converged Enhanced    Ethernet    (CEE).    The    primary
target of CEE is the convergence of Inter Process Communication (IPC), networking
and storage traffic in the data center. the   lossless   CEE functionality is conceptually
similar to the features offered by the InfiniBand data link layer and includes:IEEE   802.1Qbb   Priority   flow   control (PFC)   standardizes   a   link   level   flow control that recognizes 8 traffic classes
per port (analogous to InfiniBand virtual lanes).

The   CEE   new   standards   include:   802.1Qbb   –   Priority-based   flow
control, 802.1Qau  – End-to-End Congestion Notification,   and   802.1Qaz   –   Enhanced
Transmission   Selection   and   Data   Center   Bridge   Exchange.

The   lossless   delivery   features in CEE enables a natural choice for
building RDMA, SEND/RECEIVE and kernel bypass services over CE
E is to apply RDMA transport   services   over   CEE   or   in   short   RoCE.

ROCE   is   compliant   with   the   OFA   verbs   definition and is interoperable with the OFA
software stack (similar to InfiniBand and iWARP).

Many    Linux    distributions, which include OFED, support a wide    and    rich    range    of    middleware    and
application   solutions   such   as   IPC,   sockets, messaging, virtualization, SAN, NAS, file systems and databases, which enable RoCE to deliver all
three   dimensions   of   unified   networking   on Ethernet – IPC, NAS and SAN.

RoCE, focuses on short range server to server and server to storage networks,
delivering the lowest latency and jitter characteristics    and    enabling    simpler
software                and                hardware                implementations

RoCE supports the OFA verbs interface seamlessly. The OFA verbs used by RoCE
are based on IB and have been proven in   large   scale   deployments   and   with
multiple ISV applications, both in the HPC and EDC sectors
http://www.itc23.com/fileadmin/ITC23_files/papers/DC-CaVES-PID2019735.pdf

Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth.[5] The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol.[6] There exist RoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds[7][8] while the lowest known iWARP HCA latency in 2011 was 3 microseconds

The RoCE v1 protocol is an Ethernet link layer protocol with ethertype 0x8915.This means that the frame length limits of the Ethernet protocol apply: 1500 bytes for a regular Ethernet frame and 9000 bytes for a jumbo frame.

The RoCEv2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol.The UDP destination port number 4791 has been reserved for RoCE v2.Since RoCEv2 packets are routable the RoCE v2 protocol is sometimes called Routable RoCE[11] or RRoCE.[3] Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered

https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

RoCE versus InfiniBand

RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network
RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.[15] Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet
https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

RoCE is a standard protocol defined in the InfiniBand Trade Association (IBTA) standard. RoCE makes use of UDP encapsulation allowing it to transcend Layer 3 networks. RDMA is a key capability natively used by the InfiniBand interconnect technology. Both InfiniBand and Ethernet RoCE share a common user API but have different physical and link layers.

http://www.mellanox.com/page/products_dyn?product_family=79

Why RoCE?

RDMA over Converged Ethernet (RoCE) allows all the advantages of RDMA, but on existing Ethernet
networks. With RoCE there is no need to convert a data center from Ethernet to InfiniBand, saving
companies massive amounts of capital expenditures. There is no difference to an application between
using RDMA over InfiniBand or over Ethernet, so application writers who are more comfortable in an
Ethernet environment are well-covered by RoCE.
http://www.mellanox.com/related-docs/whitepapers/roce_in_the_data_center.pdf

Remote Direct Memory Access (RDMA) provides direct access from the memory of one computer to

the memory of another without involving either computer’s operating system. This technology enables
high-throughput, low-latency networking with low CPU utilization, which is especially useful in massively
parallel compute clusters
http://www.mellanox.com/related-docs/whitepapers/WP_RoCE_vs_iWARP.pdf

There are two different RDMA performance testing packages included with Red Hat Enterprise Linux, qperf and perftest. Either of these may be used to further test the performance of an RDMA network.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-testing_an_rdma_network_after_ipoib_is_configured

InfiniBand is a great networking protocol that has many features and provides great performance. However, unlike RoCE and iWARP, which are running over an Ethernet infrastructure (NIC, switches and routers) and support legacy (IP-based) applications, by design. InfiniBand, as a completely different protocol, uses a different addressing mechanism, which isn't IP, and doesn't support sockets - therefore it doesn't support legacy applications. This means that in some clusters InfiniBand can't be used as the only interconnect since many management systems are IP-based. The result of this could have been that clusters that use InfiniBand for the data traffic may deploy an Ethernet infrastructure in the cluster as well (for the management). This increases the price and complexity of cluster deployment.

IP over InfiniBand (IPoIB) architecture

The IPoIB module registers to the local Operating System's network stack as an Ethernet device and translate all the needed functionality between Ethernet and InfiniBand.
Such IPoIB network interface is created for every port of the InfiniBand device. In Linux, the prefix of those interfaces is "ib".

The traffic that is sent over the IPoIB network interface uses the network stack of the kernel and doesn't benefit from features of the InfiniBand device:
kernel bypass
reliability
zero copy
splitting and assembly of messages to packets, and more.

IPoIB Limitations
IPoIB solves many problems for us. However, compared to a standard Ethernet network interface, it has some limitations:

    IPoIB supports IP-based application only (since the Ethernet header isn't encapsulated).
    SM/SA must always be available in order for IPoIB to function.
    The MAC address of an IPoIB network interface is 20 bytes.
    The MAC address of the network interface can't be controlled by the user.
    The MAC address of the IPoIB network interface may change in consecutive loading of the IPoIB module and it isn't persistent (i.e. a constant attribute of the interface).
    Configuring VLANs in an IPoIB network interface requires awareness of the SM to the corresponding P_Keys.
    Non-standard interface for managing VLANs.

http://www.rdmamojo.com/2015/02/16/ip-infiniband-ipoib-architecture/

RDMA support for Infiniband, RoCE (RDMA over Converged Ethernet), and Omni-Path in BeeGFS is based on the Open Fabrics Enterprise Distribution ibverbs API (http://www.openfabrics.org).

https://www.beegfs.io/wiki/NativeInfinibandSupport

Competing networks

General purpose network (e.g. Ethernet)
High-performance network with RDMA (e.g. InfiniBand)
http://www.cs.cmu.edu/~wtantisi/files/nfs-pdl08-poster.pdf

In our test environment we used two 240GB Micron M500DC SSDs in RAID0 in each of our two nodes. We connected the two peers using Infiniband ConnectX-4 10Gbe. We then ran a series of tests to compare the performance of DRBD disconnected (not-replicating), DRBD connected using TCP over Infiniband, and DRBD connected using RDMA over Infiniband, all against the performance of the backing disks without DRBD.

For testing random read/write IOPs we used fio with 4K blocksize and 16 parallel jobs. For testing sequential writes we used dd with 4M blocks. Both tests used the appropriate flag for
direct IO in order to remove any caching that might skew the results.

https://www.linbit.com/en/drbd-9-over-rdma-with-micron-ssds/

Ethernet and Infiniband are common network connections used in the supercomputing world. Infiniband was introduced in 2000 to tie memory and the processes of multiple servers together, so communication speeds would be as fast as if they were on the same PCB

http://journals.sfu.ca/jsst/index.php/jsst/article/viewFile/22/pdf_16

the value of Infiniband as a storage protocol to replace FibreChannel with several SSD vendors offering Infiniband options. Most likely this is necessary to allow servers to get enough network speed but mostly to reduce latency and CPU consumption. Good Infiniband networks have latency measured in hundreds of nanoseconds and much lower impact on system CPU because Infiniband uses RDMA to transfer data. RDMA

http://etherealmind.com/infiniband-ethernet-rocee-better-than-dcb/

Soft RoCE is a software implementation that allows RoCE to run on any Ethernet network adapter whether it offers hardware acceleration or not. It leverages the same efficiency characteristics as RoCE, providing a complete RDMA stack implementation over any NIC. Soft RoCE driver implements the InfiniBand RDMA transport over the Linux network stack. It enables a system with standard Ethernet adapter to inter-operate with hardware RoCE adapter or with another system running Soft-RoCE.

http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v4_3.pdf

Starting the RDMA services

If one is using the InfiniBand transport and he doesn't have a managed switch in the subnet, he has to start the Subnet Manager (SM). Doing this in one of the machines in the subnet is enough, this can be done with the following command line:
[root@localhost]# /etc/init.d/opensmd start

2. RDMA needs to work with pinned memory, i.e. memory which cannot be swapped out by the kernel. By default, every process that is running as a non-root user is allowed to pin a low amount of memory (64KB). In order to work properly as a non-root user, it is highly recommended to increase the size of memory which can be locked. Edit the file /etc/security/limits.conf and add the following lines, if they weren't added by the installation script:
* soft memlock unlimited
* hard memlock unlimited
http://www.rdmamojo.com/2014/11/22/working-rdma-using-mellanox-ofed/

Native Infiniband / RoCE / Omni-Path Support (RDMA)

RDMA support for Infiniband, RoCE (RDMA over Converged Ethernet), and Omni-Path in BeeGFS is based on the Open Fabrics Enterprise Distribution ibverbs API (http://www.openfabrics.org).
https://www.beegfs.io/wiki/NativeInfinibandSupport

RDMA is Remote Dynamic Memory Access which is a way of moving buffers between two applications across a network. RDMA differs from traditional network interfaces because it bypasses the operating system. This allows programs that implement RDMA to have:

    The absolute lowest latency
    The highest throughput
    Smallest CPU footprint

To make use of RDMA we need to have a network interface card that implements an RDMA engine.
We call this an HCA (Host Channel Adapter). The adapter creates a channel from it’s RDMA engine though the PCI Express bus to the application memory.
In RDMA we setup data channels using a kernel driver. We call this the command channel.
We use the command channel to establish data channels which will allow us to move data bypassing the kernel entirely
Once we have established these data channels we can read and write buffers directly.
The API to establish these the data channels are provided by an API called “verbs”.

The verbs API is a maintained in an open source linux project called the Open Fabrics Enterprise Distribution (OFED). (www.openfabrics.org).
There is an equivalent project for Windows WinOF

Queue Pairs
RDMA operations start by “pinning” memory. When you pin memory you are telling the kernel that this memory is owned by the application. Now we tell the HCA to address the memory and prepare a channel from the card to the memory. We refer to this as registering a Memory Region. We can now use the memory that has been registered in any of the RDMA operations we want to perform
RDMA is an asynchronous transport mechanism.

https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-introduction-to-rdma/

The following configurations need to be modified after the installation is complete:

/etc/rdma/rdma.conf
/etc/udev.d/rules.d/70-persistent-ipoib.rules
/etc/security/limits.d/rdma.conf

The rdma service reads /etc/rdma/rdma.conf to find out which kernel-level and user-level RDMA protocols the administrator wants to be loaded by default. You should edit this file to
turn various drivers on or off. The rdma package provides the file /etc/udev.d/rules.d/70-persistent-ipoib.rules. This udev rules file is used to rename IPoIB devices from
their default names (such as ib0 and ib1) to more descriptive names. You should edit this file to change how your devices are named.
RDMA communications require that physical memory in the computer be pinned (meaning that the kernel is not allowed to swap that memory out to a paging file in the event that the
overall computer starts running short on available memory). Pinning memory is normally a very privileged operation. In order to allow users other than root to run large RDMA
applications, it will likely be necessary to increase the amount of memory that non-root users are allowed to pin in the system. This is done by adding the file rdma.conf file in the
/etc/security/limits.d/directory with contents

Because RDMA applications are so different from Berkeley Sockets-based applications and from normal IP networking, most applications that are used on an IP network cannot be used
directly on an RDMA network.
https://lenovopress.com/lp0823.pdf

Dramatically improves the performance of socket-based applications

Mellanox Messaging Accelerator (VMA) boosts the performance of message-based and streaming applications across a wide range of industries. These include the Financial Service market's High-Frequency Trading (HFT) platforms, Web2.0 clusters
VMA is an Open Source library project that exposes standard socket APIs with kernel-bypass architecture, enabling user-space networking for Multicast, UDP unicast and TCP streaming. VMA also features ease of use out-of-the-box, with built-in preconfigured profiles such as for latency or streaming.
http://www.mellanox.com/page/software_vma?mtag=vma

Using RDMA has the following major advantages:

    Zero-copy - applications can perform data transfer without the network software stack involvement and data is being sent received directly to the buffers without being copied between the network layers.
    Kernel bypass - applications can perform data transfer directly from userspace without the need to perform context switches.
    No CPU involvement - applications can access remote memory without consuming any CPU in the remote machine. The remote memory machine will be read without any intervention of a remote process (or processor). The caches in the remote CPU(s) won't be filled with the accessed memory content.
    Message-based transactions - the data is handled as discrete messages and not as a stream, which eliminates the need of the application to separate the stream into different messages/transactions.
    Scatter/gather entries support - RDMA supports natively working with multiple scatter/gather entries i.e. reading multiple memory buffers and sending them as one stream or getting one stream and writing it to multiple memory buffers

You can find RDMA in industries that need at least one the following:

    Low latency - For example HPC, financial services, web 2.0
    High Bandwidth - For example HPC, medical appliances, storage, and backup systems, cloud computing
    Small CPU footprint - For example HPC, cloud computing

there are several network protocols which support RDMA:

    InfiniBand (IB) - a new generation network protocol which supports RDMA natively from the beginning. Since this is a new network technology, it requires NICs and switches which supports this technology.
    RDMA Over Converged Ethernet (RoCE) - a network protocol which allows performing RDMA over Ethernet network. Its lower network headers are Ethernet headers and its upper network headers (including the data) are InfiniBand headers. This allows using RDMA over standard Ethernet infrastructure (switches). Only the NICs should be special and support RoCE.
    Internet Wide Area RDMA Protocol (iWARP) - a network protocol which allows performing RDMA over TCP. There are features that exist in IB and RoCE and aren't supported in iWARP. This allows using RDMA over standard Ethernet infrastructure (switches). Only the NICs should be special and support iWARP (if CPU offloads are used) otherwise, all iWARP stacks can be implemented in SW and losing most of the RDMA performance advantages.

http://www.rdmamojo.com/2014/03/31/remote-direct-memory-access-rdma/

kernel bypass
- nvme over fabrics (NVMe-OF)
  
  NVMe basics
open fabric alliance

The OpenFabrics Enterprise Distribution (OFED™)/OpenFabrics Software is open-source software for RDMA and kernel bypass applications.OFS is used in business, research and scientific environments that require highly efficient networks, storage connectivity, and parallel computing.

Facebook Introduces A Flexible NVMe JBOF

At the Open Compute Project Summit today Facebook is announcing its flexible NVMe JBOF (just a bunch of flash), Lightning.

The HPC Project is focused on developing a fully open heterogeneous computing, networking and fabric platform Optimized for a multi-node processor that is agnostic to any computing group using x86, ARM, Open Power, FPGA, ASICs, DSP, and GPU silicon on an hardware platform

http://www.opencompute.org/projects/high-performance-computing-hpc/

NVMe Over Fabrics-NVMe-OF

Benefits of NVMe over Fabrics for disaggregation

Energy efficiency for green HPC

This is the reason why Eurotech decided to concentrate on cooling and power conversion.Our systems are entirely liquid cooled with water up to 50 °C, making the use of air conditioning redundant. They also use a technology of power conversion with an efficiency of 97%.this means an Eurotech based data center can have a PUE of 1,05!
https://www.eurotech.com/en/hpc/hpcomputing/energy+efficient+hpc

PUE for a dedicated building is the total facility energy divided by the IT equipment energy. PUE is an end-user metric used to help improve energy

efficiency in data center operations.
https://eehpcwg.llnl.gov/documents/infra/09_pue_past-tue_future.pdf

green HPC

The OpenFabrics Enterprise Distribution (OFED™)/OpenFabrics Software is open-source software for RDMA and kernel bypass applications. OFS is used in business, research and scientific environments that require highly efficient networks, storage connectivity, and parallel computing. The software provides high-performance computing sites and enterprise data centers with flexibility and investment protection as computing evolves towards applications that require extreme speeds, massive scalability, and utility-class reliability.

OFS includes kernel-level drivers, channel-oriented RDMA and send/receive operations, kernel bypasses of the operating system, both kernel and user-level application programming interface (API) and services for parallel message passing (MPI), sockets data exchange (e.g., RDS, SDP), NAS and SAN storage (e.g. iSER, NFS-RDMA, SRP) and file system/database systems.
The network and fabric technologies that provide RDMA performance with OFS include legacy 10 Gigabit Ethernet, iWARP for Ethernet, RDMA over Converged Ethernet (RoCE), and 10/20/40 Gigabit InfiniBand.
OFS is available for many Linux and Windows distributions, including Red Hat Enterprise Linux (RHEL), Novell SUSE Linux Enterprise Distribution (SLES), Oracle Enterprise Linux (OEL) and Microsoft Windows Server operating systems. Some of these distributions ship OFS in-box.
OFS for High-Performance Computing.
OFS is used in high-performance computing for breakthrough applications that require high-efficiency computing, wire-speed messaging, microsecond latencies and fast I/O for storage and file systems
OFS for Enterprise Data Centers
OFS delivers valuable benefits to end-user organizations, including high CPU efficiency, reduced energy consumption, and reduced rack-space requirements.
https://www.openfabrics.org/index.php/openfabrics-software.html

EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High-Performance Computing (HPC) systems in an efficient way.

https://github.com/easybuilders/easybuild

Using the Open Build Service (OBS) to manage build process

http://hpcsyspros.lsu.edu/2016/slides/OpenHPC-hycsyspro-overview.pdf

Welcome to the OpenHPC site. OpenHPC is a collaborative, community effort that initiated from a desire to aggregate a number of common ingredients required to deploy and manage High-Performance Computing (HPC) Linux clusters including provisioning tools, resource management, I/O clients, development tools, and a variety of scientific libraries. Packages provided by OpenHPC have been pre-built with HPC integration in mind with a goal to provide re-usable building blocks for the HPC community

https://openhpc.community/

This stack provides a variety of common, pre-built ingredients required to deploy and manage an HPC Linux cluster including provisioning tools, resource management, I/O clients, runtimes, development tools, and a variety of scientific libraries.

https://github.com/openhpc/ohpc

OpenHPC containers

OpenHPC software stack

OpenHPC v1.3 software components

OpenHPC (v1.3.4)

Cluster Building Recipes

CentOS7.4 Base OS
Warewulf/PBS Professional Edition for Linux* (x86 64)

This guide presents a simple cluster installation procedure using components from the OpenHPC software
stack. OpenHPC represents an aggregation of a number of common ingredients required to deploy and manage an HPC Linux* cluster including

provisioning tools,
resource management,
I/O clients,
development tools,
and a variety of scientific libraries

These packages have been pre-built with HPC integration in mind while conforming to common Linux distribution standards.

This guide is targeted at experienced Linux system administrators for HPC environments. Knowledge of
software package management, system networking, and PXE booting are assumed.
Unless specified otherwise, the examples presented are executed with elevated (root) privileges.
The examples also presume the use of the BASH login shell

This installation recipe assumes the availability of a single head node master, and four compute nodes.
4-node stateless cluster installation
The master node serves as the overall system management server (SMS) and is provisioned with CentOS7.4 and is
subsequently configured to provision the remaining compute nodes with Warewulf in a stateless configuration

The terms master and SMS are used interchangeably in this guide. For power management, we assume that
the compute node baseboard management controllers (BMCs) are available via IPMI from the chosen master
host.
For file systems, we assume that the chosen master server will host an NFS file system that is made
available to the compute nodes.
Installation information is also discussed to optionally mount a parallel
file system and in this case, the parallel file system is assumed to exist previously.

The master host requires at least two Ethernet interfaces with eth0 connected to
the local data center network and eth1 used to provision and manage the cluster backend
Two logical IP interfaces are expected to each compute node: the first is the standard Ethernet interface that
will be used for provisioning and resource management. The second is used to connect to each host’s BMC
and is used for power management and remote console access.
The second is used to connect to each host’s BMC
and is used for power management and remote console access. Physical connectivity for these two logical
IP networks is often accommodated via separate cabling and switching infrastructure; however, an alternate
configuration can also be accommodated via the use of a shared NIC, which runs a packet filter to divert
management packets between the host and BMC
In addition to the IP networking, there is an optional high-speed network (InfiniBand or Omni-Path
in this recipe) that is also connected to each of the hosts. This high-speed network is used for application
message passing and optionally for parallel file system connectivity as well (e.g. to existing Lustre or BeeGFS
storage targets)

In an external setting, installing the desired BOS on a master SMS host typically involves booting from a
DVD ISO image on a new server. With this approach, insert the CentOS7.4 DVD, power cycle the host, and
follow the distro provided directions to install the BOS on your chosen master host

http://openhpc.community/downloads/

xCAT offers complete management for HPC clusters, RenderFarms, Grids, WebFarms, Online Gaming Infrastructure, Clouds, Datacenters etc. xCAT provides an extensible framework that is base

https://xcat.org/

XCAT CLUSTER MANAGEMENT

The following diagram shows the basic structure of xCAT

PBS Professional software optimizes job scheduling and workload management in high-performance computing (HPC) environments

https://www.pbspro.org/

Warewulf is a scalable systems management suite originally developed to manage large high-performance Linux clusters.

http://warewulf.lbl.gov/

Hence the emergence of the tangible need for supercomputers, able to perform a large number of checks required in an extremely short time.

In addition, modern methods for the identification of cyber crimes increasingly involve techniques featuring cross-analysis of data coming from several different sources – and these techniques further increase the computing capacity required.
The nowadays security solutions require a mix of software and hardware to boost the power of security algorithms, analyze enormous data quantities in real time, rapidly crypt and decrypt data, identify abnormal patterns, check identities, simulate attacks, validate software security proof, petrol systems, analyze video material and many more additional actions
The Aurora on-node FPGAs allow for secure encryption/decryption to coexist with the fastest processors in the market. The Aurora Dura Mater rugged HPC, an outdoor portable highly robust system, can be deployed closer to where the data is captured to allow a number of on filed operations like for instance the processing of video surveillance data, bypassing network bottlenecks.
https://www.eurotech.com/en/hpc/industry+solutions/cyber+security

This is particularly true with respect to operational cybersecurity that, at best, is seen as a necessary evil, and considered as generally restrictive of performance and/or functionality. As a representative high-performance open-computing site, NERSC has decided to place as few roadblocks as possible for access to site computational and networking resources

https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Monday/03B-Mellander-Paper.pdf

Activate tools such as SELinux that will control the access through predefined roles, but most HPC centers do not activate it as it breaks down the normal operation of the Linux clusters and it becomes harder to debug when applications don’t work as expected

https://idre.ucla.edu/sites/default/files/cybersecurity_hpc.pdf?x83242

The traffic among HPC systems connected through public or private network now is exclusively through encrypted protocols using OpenSSL such as ssh, sftp, https etc

HPC resources are running some version of Linux operating system they all invariably run Iptables based firewall at the host level, which is the primary tool to restrict
access to service ports from outside network.
https://idre.ucla.edu/sites/default/files/cybersecurity_hpc.pdf?x83242

Some of the HPC centers do not rely on users in protecting their password. So they implemented what is called One Time Password (OTP) where users are given small calculator-like devices to generate a random key to login to the system. However, that is inconvenient for users, as they have to carry this device all the time with them and also add to the operating cost of HPC systems

http://www.hpctoday.com/viewpoints/cyber-security-in-high-performance-computing-environment/2/

Protecting a High-Performance Computing (HPC) cluster against real-world cyber threats is a critical task nowadays, with the increasing trend to open and share computing resources.

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5633784

Security of the clusters accesses usually relies on Discretionary Access Control (DAC) and sandboxing such as chroot, BSD Jail [1] or Vserver [2]. The DAC model has proven to be fragile [3]. Moreover, virtual containers do not protect against privilege escalation where users can get administration privileges in order to compromise data

confidentiality and integrity
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5633784

GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. The GridFTP protocol is based on FTP, the highly-popular Internet file transfer protocol.

http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/

What is GridFTP? •A protocol for efficient file transfer •An extension of FTP •Currently the de facto standard for moving large files across the Internet

Using GridFTP: Some background •You move files between “endpoints” identified by a URI •One of these endpoints is often your local machine, but GridFTP can also be used to move data between two remote endpoints •X.509 (“grid”) certificates are used for authentication to a particular end-point –Usually you’ll have one certificate that you use at both end-points, but this need not be the case
Features of GridFTP•Security with GSI, The Grid Security Infrastructure •Third party transfers •Parallel and striped transfer•Partial file transfer•Fault tolerance and restart •Automatic TCP optimisation
https://www.eudat.eu/sites/default/files/GridFTP-Intro-Rome.pdf

This paper describes FLEXBUS, a flexible, high-performance onchip communication architecture featuring a dynamically configurable topology. FLEXBUS is designed to detect run-time variations in communication traffic characteristics, and efficiently adapt the topology of the communication architecture, both at the system-level, through dynamic bridge by-pass, as well as at the component-level, using component re-mapping.

https://www10.edacafe.com/nbc/articles/1/212767/FLEXBUS-High-Performance-System-on-Chip-Communication-Architecture-with-Dynamically-Configurable-Topology-Technical-Paper

High Performance Computing (HPC) Fabric Builders

Intel® Omni-Path Fabric (Intel® OP Fabric) has a robust architecture and feature set to meet the current and future demands of high performance computing (HPC) at a cost-competitive price to today's fabric. Fabric builders members are working to enable world-class solutions based on the Intel® Omni-Path Architecture (Intel® OPA) with a goal of helping end users build the best environment to meet their HPC needs.
https://www.intel.com/content/www/us/en/products/network-io/high-performance-fabrics.html

High-Performance Computing Solutions

In-memory high performance computing
Build an HPE and Intel® OPA foundation
Solve supercomputing challenges
https://www.hpe.com/us/en/solutions/hpc-high-performance-computing.html

Memory-Driven Computing: the ultimate composable infrastructure

RoCE vs. iWARP Q&A

two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP.

Q. Does RDMA use the TCP/UDP/IP protocol stack?
A. RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don’t use Ethernet.

Q. Can Software Defined Networking features like VxLANs be implemented on RoCE/iWARP NICs?
A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality.

Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP?
A. RDMAP/DDP/MPA are stacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model.

Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP?
A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most.
iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration as well as out-of-the-box support across LAN/MAN/WAN networks.

Q. Is iWARP a lossless or losssy protocol?
A. iWARP utilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality.
http://sniaesfblog.org/roce-vs-iwarp-qa/

A Brief about RDMA, RoCE v1 vs RoCE v2 vs iWARP

For those who are wondering what these words are, this is a post about Networking and an overview can be, how to increase your network speed without adding new servers or IB over your network, wherein IB stands for InfiniBand which is basically a networking communication standard majorly used in HPC that features very high throughput and very low latency, for what HPC is.

DBC
Data-Center Bridging (DCB) is an extension to the Ethernet protocol that makes dedicated traffic flows possible in a converged network scenario. DCB distinguishes traffic flows by tagging the traffic with a specific value (0-7) called a “CoS” value which stands for Class of Service. CoS values can also be referred to as “Priority” or “Tag”.
PFC
In standard Ethernet, we have the ability to pause traffic when the receive buffers are getting full. The downside of ethernet pause is that it will pause all traffic on the link. As the name already gives it away, Priority-based Flow Control (PFC) can pause the traffic per flow based on that specific Priority, in other words; PFC creates Pause Frames based on a traffic CoS value.
ETS
With DCB in-place the traffic flows are nicely separated from each other and can pause independently because of PFC but, PFC does not provide any Quality-of-Service (QoS). If your servers are able to fully utilize the full network pipe with only storage traffic, other traffic such as cluster heartbeat or tenant traffic may come in jeopardy.
The purpose of Enhanced Transmission Selection (ETS) is to allocate bandwidth based on the different priority settings of the traffic flows, this way the network components share the same physical pipe but ETS makes sure that everyone gets the share of the pipe specified and prevent the “noisy neighbor” effect
Data Center Bridging Exchange Protocol
This protocol is better known as DCBX as in also an extension on the DCB protocol, where the “X” stands for eXchange.
DCBX can be used to share information about the DCB settings between peers (switches and servers) to ensure you have a consistent configuration on your network
http://innovativebit.blogspot.com/2018/06/a-brief-about-rdma-roce-v1-vs-roce-v2.html

File Storage Types and Protocols for Beginners

DAS SCSI

FC SAN

iSCSI

NAS

SMB Multichannel and SMB Direct

SMB Multichannel is the feature responsible for detecting the RDMA capabilities of network adapters to enable SMB Direct. Without SMB Multichannel, SMB uses regular TCP/IP with the RDMA-capable network adapters (all network adapters provide a TCP/IP stack along with the new RDMA stack).

https://docs.microsoft.com/en-us/windows-server/storage/file-server/smb-direct

What is SMB Direct?

SMB Direct is SMB 3 traffic over RDMA. So, what is RDMA? DMA stands for Direct Memory Access. This means that an application has access (read/write) to a hosts memory directly with CPU intervention. If you do this between hosts it becomes Remote Direct Memory Access (RDMA).

https://www.starwindsoftware.com/blog/smb-direct-the-state-of-rdma-for-use-with-smb-3-traffic-part-i

High-end GPFS HPC box gets turbo-charged

DDN's GRIDScaler gets bigger scale-up and scale-out muscles

MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard

https://www.mpich.org/

Catalyst UK Program successfully drives development of Arm-based HPC systems

The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

https://www.open-mpi.org/

Best Practice Guide – Parallel I/O

Can (HPC)Clouds supersede traditional High Performance Computing?

The MVAPICH2 software, based on MPI 3.1 standard, delivers the best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE networking technologies.

http://mvapich.cse.ohio-state.edu/

Building HPC Cloud with InfiniBand: Efficient Support in MVAPICH2, Xiaoyi Lu

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds

Keynote: Designing HPC, BigD & DeepL MWare for Exascale - DK Panda, The Ohio State University

OpenStack and HPC Network Fabrics

HPC Networking in OpenStack: Part 1

OpenStack and HPC Workload Management

A Self-Adaptive Network for HPC Clouds: Architecture, Framework, and Implementation

OpenStack and Virtualised HPC

Making Containers Easier with HPC Container Maker

SUSE Linux Enterprise High Performance Computing

OSC's Owens cluster being installed in 2016 is a Dell-built, Intel® Xeon® processor-based supercomputer.

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

Introduction to Linux & HPC

One Library with Multiple Fabric Support

Intel® MPI Library is a multifabric message-passing library that implements the open-source MPICH specification. Use the library to create, maintain, and test advanced, complex applications that perform better on HPC clusters based on Intel® processors.

https://software.intel.com/en-us/mpi-library