Facebook is announcing its flexible NVMe JBOF (just a bunch of flash), Lightning.
http://www.storagereview.com/facebook_introduces_a_flexible_nvme_jbof
"NVMe in a shared chassis look like an internal drive – so it's not shared data. It's not a SAN. You cannot run VMFS (VMware file system) on top of that."
With NVMeF there is a remote direct memory access from a host server to a drive in an NVMeF array.
If you have direct host server access to the array's drives across NVMeF then the array is effectively controller-less, and just a bunch of flash drives (JBOF)
"A SAN? That's associated with Fibre Channel and SCSI. I would call it NVMeF. This is a SAN replacement or next-generation of SAN. It may be a bit confusing to call it a SAN."
https://www.theregister.co.uk/2017/06/23/is_an_nvme_over_fabrics_array_a_san/
Is an NVMe over fabrics array a SAN? "NVMe in a shared chassis look like an internal drive – so it's not shared data. It's not a SAN. You cannot run VMFS (VMware file system) on top of that." With NVMeF there is a remote direct memory access from a host server to a drive in an NVMeF array. If you have direct host server access to the array's drives across NVMeF then the array is effectively controller-less, and just a bunch of flash drives (JBOF) "A SAN? That's associated with Fibre Channel and SCSI. I would call it NVMeF. This is a SAN replacement or next-generation of SAN. It may be a bit confusing to call it a SAN."
The Open Science Grid (OSG) provides common service and support for resource providers and scientific institutions (i.e., "sites") using a distributed fabric of high throughput computational services.This documentation aims to provide HTC/HPC system administrators with the necessary information to contribute resources to the OSG.An existing compute cluster running on a supported operating system with a supported batch system: Grid Engine, HTCondor, LSF, PBS Pro/Torque, or Slurm.
https://opensciencegrid.org/docs/
WekaIO Matrix and Univa Grid Engine – Extending your On-Premise HPC Cluster to the Cloud
The capacity of the file system will depend on the number of cluster hosts, and the size and number of SSDs on each host. For example, if I need 100TB of shared usable storage, I can accomplish this by using a cluster of 12 x i3.16xlarge storage dense instances on AWS where each instance has 8 x 1,900GB NVMe SSDs for a total of 15.2 TB of SSD storage per instance. A twelve node cluster, in theory, would have 182.4 TB of capacity, but if we deploy a stripe size of eight (6N + 2P) approximately 25% of our capacity will be dedicated to storing parity. 10% of the Matrix file system is recommended to be held in reserve for internal housekeeping and caching so the usable capacity of the twelve node cluster is ~123 TB (aGridFTPssuming no backing store).
High-performance computing (HPC) is the ability to process data and perform complex calculations at high speeds.
One of the best-known types of HPC solutions is the supercomputer. A supercomputer contains thousands of compute nodes that work together to complete one or more tasks. This is called parallel processing. It’s similar to having thousands of PCs networked together, combining compute power to complete tasks faster.
Open Source High-Performance Computing The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is, therefore, able to combine the expertise, technologies, and resources from all across the High-Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers. https://www.open-mpi.org/
Introduction to high performance computing, what is it, how to use it and when
Simply a multi-user, shared and smart batch processing system
Improves the scale & size of processing significantly
With raw power & parallelization
Thanks to rapid advances in low cost micro-processors, high-speed networks and optimized software
Some workloads have very high I/O throughput, so to make sure these requirements are met, the kernel uses schedulers.
Schedulers do exactly what they say: schedule activities within the kernel so that system activities and resources are scheduled to achieve an overall goal for the system. This goal could be low latency for input (as embedded systems require), better interactivity, faster I/O, or even a combination of goals. Primarily, schedulers are concerned with CPU resources, but they could also consider other system resources (e.g., memory, input devices, networks, etc.).
http://www.admin-magazine.com/HPC/Articles/Linux-I-O-Schedulers?utm_source=ADMIN+Newsletter&utm_campaign=HPC_Update_110_2018-03-22_Linux_I%2FO_Schedulers&utm_medium=email
The general goal of HPC is either to run applications faster or to run problems that can’t or won’t run on a single server. To do this, you need to run parallel applications across separate nodes.If you are interested in parallel computing using multiple nodes, you need at least two separate systems (nodes), each with its own operating system (OS). To keep things running smoothly, the OS on both nodes should be identical. (Strictly speaking, it doesn’t have to be this way, but otherwise, it is very difficult to run and maintain.) If you install a package on node 1, then it needs to be installed on node 2 as well. This lessens a source of possible problems when you have to debug the system.
OFED (OpenFabrics Enterprise Distribution) is a package that developed and released by the OpenFabrics Alliance (OFA), as a joint effort of many companies that are part of the RDMA scene. It contains the latest upstream software packages (both kernel modules and userspace code) to work with RDMA. It supports InfiniBand, Ethernet and RoCE transport.
It is possible to install the RDMA packages, in a convenient and simple way, using the OFED distribution or using the packages that shipped within the Linux distributions.
http://www.rdmamojo.com/2014/12/27/installation-rdma-stack-manually/
OFED (OpenFabrics Enterprise Distribution) is a package that developed and released by the OpenFabrics Alliance (OFA), as a joint effort of many companies that are part of the RDMA scene.
It contains the latest upstream software packages (both kernel modules and userspace code) to work with RDMA. It supports InfiniBand, Ethernet and RoCE transports
http://www.rdmamojo.com/2014/11/30/working-rdma-using-ofed/
What is RoCE?
RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to dramatically accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory to memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by hardware resulting in dramatically lower latency, higher throughput, and better performance compared to software based protocols.
What are the differences between RoCE v1 and RoCE v2?
As originally implemented and standardized by the InfiniBand Trade Association (IBTA) RoCE was envisioned as a layer 2 protocol.
These limitations have driven the demand for RoCE to operate in layer 3 (routable) environments. Fortunately a straightforward extension of the RoCE framework allows it to be readily transported across layer 3 networks.
https://community.mellanox.com/docs/DOC-1451
Remote Direct Memory Access (RDMA) provides direct memory access from the memory of one host (storage or compute) to the memory of another host without involving the remote Operating System and CPU, boosting network and host performance with lower latency, lower CPU load and higher bandwidth.
RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed.
RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network RoCE versus iWARP
While the RoCE protocols define how to perform RDMA using Ethernet and UDP/IP frames, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the Transmission Control Protocol (TCP).RoCE v2 and iWARP packets are routable.
https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet
RDMA or Remote Direct Memory Access, communications using Send/Receive
semantics and kernel bypass technologies in server and storage interconnect products
permit high through-put and low-latency networking.
The set of standards, defined by the Data Center Bridging (DCB) task group within
IEEE 802.1 is popularly known as Converged Enhanced Ethernet (CEE). The primary
target of CEE is the convergence of Inter Process Communication (IPC), networking
and storage traffic in the data center. the lossless CEE functionality is conceptually
similar to the features offered by the InfiniBand data link layer and includes:IEEE 802.1Qbb Priority flow control (PFC) standardizes a link level flow control that recognizes 8 traffic classes
per port (analogous to InfiniBand virtual lanes).
The CEE new standards include: 802.1Qbb – Priority-based flow
control, 802.1Qau – End-to-End Congestion Notification, and 802.1Qaz – Enhanced
Transmission Selection and Data Center Bridge Exchange.
The lossless delivery features in CEE enables a natural choice for
building RDMA, SEND/RECEIVE and kernel bypass services over CE
E is to apply RDMA transport services over CEE or in short RoCE.
ROCE is compliant with the OFA verbs definition and is interoperable with the OFA
software stack (similar to InfiniBand and iWARP).
Many Linux distributions, which include OFED, support a wide and rich range of middleware and
application solutions such as IPC, sockets, messaging, virtualization, SAN, NAS, file systems and databases, which enable RoCE to deliver all
three dimensions of unified networking on Ethernet – IPC, NAS and SAN.
RoCE, focuses on short range server to server and server to storage networks,
delivering the lowest latency and jitter characteristics and enabling simpler
software and hardware implementations
RoCE supports the OFA verbs interface seamlessly. The OFA verbs used by RoCE are based on IB and have been proven in large scale deployments and with
multiple ISV applications, both in the HPC and EDC sectors
http://www.itc23.com/fileadmin/ITC23_files/papers/DC-CaVES-PID2019735.pdf
Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth.[5] The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol.[6] There existRoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds[7][8] while the lowest known iWARP HCA latency in 2011 was 3 microseconds
The RoCE v1 protocol is an Ethernet link layer protocol with ethertype 0x8915.This means that the frame length limits of the Ethernet protocol apply: 1500 bytes for a regular Ethernet frame and 9000 bytes for a jumbo frame.
The RoCEv2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol.The UDP destination port number 4791 has been reserved for RoCE v2.Since RoCEv2 packets are routable the RoCE v2 protocol is sometimes calledRoutable RoCE[11] or RRoCE.[3] Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered
RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.[15] Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet
https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet
RoCE is a standard protocol defined in the InfiniBand Trade Association (IBTA) standard. RoCE makes use of UDP encapsulation allowing it to transcend Layer 3 networks. RDMA is a key capability natively used by the InfiniBand interconnect technology. Both InfiniBand and Ethernet RoCE share a common user API but have different physical and link layers.
RDMA over Converged Ethernet (RoCE) allows all the advantages of RDMA, but on existing Ethernet
networks. With RoCE there is no need to convert a data center from Ethernet to InfiniBand, saving
companies massive amounts of capital expenditures. There is no difference to an application between
using RDMA over InfiniBand or over Ethernet, so application writers who are more comfortable in an
Ethernet environment are well-covered by RoCE.
http://www.mellanox.com/related-docs/whitepapers/roce_in_the_data_center.pdf
Remote Direct Memory Access (RDMA) provides direct access from the memory of one computer to
the memory of another without involving either computer’s operating system. This technology enables
high-throughput, low-latency networking with low CPU utilization, which is especially useful in massively
parallel compute clusters
http://www.mellanox.com/related-docs/whitepapers/WP_RoCE_vs_iWARP.pdf
There are two different RDMA performance testing packages included with Red Hat Enterprise Linux, qperf and perftest. Either of these may be used to further test the performance of an RDMA network.
InfiniBand is a great networking protocol that has many features and provides great performance. However, unlike RoCE and iWARP, which are running over an Ethernet infrastructure (NIC, switches and routers) and support legacy (IP-based) applications, by design. InfiniBand, as a completely different protocol, uses a different addressing mechanism, which isn't IP, and doesn't support sockets - therefore it doesn't support legacy applications. This means that in some clusters InfiniBand can't be used as the only interconnect since many management systems are IP-based. The result of this could have been that clusters that use InfiniBand for the data traffic may deploy an Ethernet infrastructure in the cluster as well (for the management). This increases the price and complexity of cluster deployment.
IP over InfiniBand (IPoIB) architecture
The IPoIB module registers to the local Operating System's network stack as an Ethernet device and translate all the needed functionality between Ethernet and InfiniBand. Such IPoIB network interface is created for every port of the InfiniBand device. In Linux, the prefix of those interfaces is "ib".
The traffic that is sent over the IPoIB network interface uses the network stack of the kernel and doesn't benefit from features of the InfiniBand device:
kernel bypass
reliability
zero copy
splitting and assembly of messages to packets, and more.
IPoIB Limitations IPoIB solves many problems for us. However, compared to a standard Ethernet network interface, it has some limitations:
IPoIB supports IP-based application only (since the Ethernet header isn't encapsulated).
SM/SA must always be available in order for IPoIB to function.
The MAC address of an IPoIB network interface is 20 bytes. The MAC address of the network interface can't be controlled by the user.
The MAC address of the IPoIB network interface may change in consecutive loading of the IPoIB module and it isn't persistent (i.e. a constant attribute of the interface).
Configuring VLANs in an IPoIB network interface requires awareness of the SM to the corresponding P_Keys.
Non-standard interface for managing VLANs.
RDMA support for Infiniband, RoCE (RDMA over Converged Ethernet), and Omni-Path in BeeGFSis based on the Open Fabrics Enterprise Distribution ibverbs API (http://www.openfabrics.org).
General purpose network (e.g. Ethernet)
High-performance network with RDMA (e.g. InfiniBand)
http://www.cs.cmu.edu/~wtantisi/files/nfs-pdl08-poster.pdf
In our test environment we used two 240GB Micron M500DC SSDs in RAID0 in each of our two nodes. We connected the two peers using Infiniband ConnectX-4 10Gbe. We then ran a series of tests to compare the performance of DRBD disconnected (not-replicating), DRBD connected using TCP over Infiniband, and DRBD connected using RDMA over Infiniband, all against the performance of the backing disks without DRBD.
For testing random read/write IOPs we used fio with 4K blocksize and 16 parallel jobs. For testing sequential writes we used dd with 4M blocks. Both tests used the appropriate flag for
direct IO in order to remove any caching that might skew the results.
Ethernet and Infiniband are common network connections used in the supercomputing world. Infinibandwas introduced in 2000 to tie memory and the processes of multiple servers together, so communication speeds would be as fast as if they were on the same PCB
the value of Infiniband as a storage protocol to replace FibreChannel with several SSD vendors offering Infiniband options. Most likely this is necessary to allow servers to get enough network speed but mostly to reduce latency and CPU consumption. Good Infiniband networks have latency measured in hundreds of nanoseconds and much lower impact on system CPU because Infiniband uses RDMA to transfer data. RDMA
Soft RoCE is a software implementation that allows RoCE to run on any Ethernet network adapter whether it offers hardware acceleration or not. It leverages the same efficiency characteristics as RoCE, providing a complete RDMA stack implementation over any NIC. Soft RoCE driver implements the InfiniBand RDMA transport over the Linux network stack. It enables a system with standard Ethernet adapter to inter-operate with hardware RoCE adapter or with another system running Soft-RoCE.
If one is using the InfiniBand transport and he doesn't have a managed switch in the subnet, he has to start the Subnet Manager (SM). Doing this in one of the machines in the subnet is enough, this can be done with the following command line:
[root@localhost]# /etc/init.d/opensmd start
2. RDMA needs to work with pinned memory, i.e. memory which cannot be swapped out by the kernel. By default, every process that is running as a non-root user is allowed to pin a low amount of memory (64KB). In order to work properly as a non-root user, it is highly recommended to increase the size of memory which can be locked. Edit the file /etc/security/limits.conf and add the following lines, if they weren't added by the installation script:
* soft memlock unlimited
* hard memlock unlimited
http://www.rdmamojo.com/2014/11/22/working-rdma-using-mellanox-ofed/
Native Infiniband / RoCE / Omni-Path Support (RDMA)
RDMA support for Infiniband, RoCE (RDMA over Converged Ethernet), and Omni-Path in BeeGFSis based on the Open Fabrics Enterprise Distribution ibverbs API (http://www.openfabrics.org).
https://www.beegfs.io/wiki/NativeInfinibandSupport
RDMA is Remote Dynamic Memory Access which is a way of moving buffers between two applications across a network. RDMA differs from traditional network interfaces because it bypasses the operating system. This allows programs that implement RDMA to have:
The absolute lowest latency
The highest throughput Smallest CPU footprint
To make use of RDMA we need to have a network interface card that implements an RDMA engine.
We call this an HCA (Host Channel Adapter). The adapter creates a channel from it’s RDMA engine though the PCI Express bus to the application memory.
In RDMA we setup data channels using a kernel driver. We call this the command channel.
We use the command channel to establish data channels which will allow us to move data bypassing the kernel entirely
Once we have established these data channels we can read and write buffers directly. The API to establish these the data channels are provided by an API called “verbs”.
The verbs API is a maintained in an open source linux project called the Open Fabrics Enterprise Distribution (OFED). (www.openfabrics.org).
There is an equivalent project for Windows WinOF
Queue Pairs
RDMA operations start by “pinning” memory. When you pin memory you are telling the kernel that this memory is owned by the application. Now we tell the HCA to address the memory and prepare a channel from the card to the memory. We refer to this as registering a Memory Region. We can now use the memory that has been registered in any of the RDMA operations we want to perform
RDMA is an asynchronous transport mechanism.
The rdma service reads /etc/rdma/rdma.conf to find out which kernel-level and user-level RDMA protocols the administrator wants to be loaded by default. You should edit this file to
turn various drivers on or off. The rdma package provides the file /etc/udev.d/rules.d/70-persistent-ipoib.rules. This udev rules file is used to rename IPoIB devices from
their default names (such as ib0 and ib1) to more descriptive names. You should edit this file to change how your devices are named.
RDMA communications require that physical memory in the computer be pinned(meaning that the kernel is not allowed to swap that memory out to a paging file in the event that the
overall computer starts running short on available memory). Pinning memory is normally a very privileged operation. In order to allow users other than root to run large RDMA
applications, it will likely be necessary to increase the amount of memory that non-root users are allowed topin in the system. This is done by adding the file rdma.conf file in the
/etc/security/limits.d/directory with contents
Because RDMA applications are so different from Berkeley Sockets-based applications and from normal IP networking, most applications that are used on an IP network cannot be used
directly on an RDMA network.
https://lenovopress.com/lp0823.pdf
Dramatically improves the performance of socket-based applications
Mellanox Messaging Accelerator (VMA) boosts the performance of message-based and streaming applications across a wide range of industries. These include the Financial Service market's High-Frequency Trading (HFT) platforms, Web2.0 clusters
VMA is an Open Source library project that exposes standard socket APIs with kernel-bypass architecture, enabling user-space networking for Multicast, UDP unicast and TCP streaming. VMA also features ease of use out-of-the-box, with built-in preconfigured profiles such as for latency or streaming.
http://www.mellanox.com/page/software_vma?mtag=vma
Using RDMA has the following major advantages:
Zero-copy - applications can perform data transfer without the network software stack involvement and data is being sent received directly to the buffers without being copied between the network layers.
Kernel bypass - applications can perform data transfer directly from userspace without the need to perform context switches.
No CPU involvement - applications can access remote memory without consuming any CPU in the remote machine. The remote memory machine will be readwithout any intervention of a remote process (or processor). The caches in the remote CPU(s) won't be filled with the accessed memory content.
Message-based transactions - the data is handled as discrete messages and not as a stream, which eliminates the need of the application to separate the stream into different messages/transactions.
Scatter/gather entries support - RDMA supports natively working with multiple scatter/gather entries i.e. reading multiple memory buffers and sending them as one stream or getting one stream and writing it to multiple memory buffers
You can find RDMA in industries that need at least one the following:
Low latency - For example HPC, financial services, web 2.0
High Bandwidth - For example HPC, medical appliances, storage, and backup systems, cloud computing
Small CPU footprint - For example HPC, cloud computing
there are several network protocols which support RDMA:
InfiniBand (IB) - a new generation network protocol which supports RDMA natively from the beginning. Since this is a new network technology, it requires NICs and switches which supports this technology.
RDMA Over Converged Ethernet (RoCE) - a network protocol which allows performing RDMA over Ethernet network. Its lower network headers are Ethernet headers and its upper network headers (including the data) are InfiniBand headers. This allows using RDMA over standard Ethernet infrastructure (switches). Only the NICs should be special and support RoCE.
Internet Wide Area RDMA Protocol (iWARP) - a network protocol which allows performing RDMA over TCP. There are features that exist in IB and RoCE and aren't supported in iWARP. This allows using RDMA over standard Ethernet infrastructure (switches). Only the NICs should be special and support iWARP (if CPU offloads are used) otherwise, all iWARP stacks can be implemented in SW and losing most of the RDMA performance advantages.
The OpenFabrics Enterprise Distribution (OFED™)/OpenFabrics Software is open-source software for RDMA and kernel bypass applications.OFS is used in business, research and scientific environments that require highly efficient networks, storage connectivity, and parallel computing.
OFS includes kernel-level drivers, channel-oriented RDMA and send/receive operations, kernel bypasses of the operating system, both kernel and user-level application programming interface (API) and services for parallel message passing (MPI), sockets data exchange (e.g., RDS, SDP), NAS and SAN storage (e.g. iSER, NFS-RDMA, SRP) and file system/database systems. The network and fabric technologies that provide RDMA performance with OFS include legacy 10 Gigabit Ethernet, iWARP for Ethernet, RDMA over Converged Ethernet (RoCE), and 10/20/40 Gigabit InfiniBand. OFS is used in high-performance computing for breakthrough applications that require high-efficiency computing, wire-speed messaging, microsecond latencies and fast I/O for storage and file systems.
OFS delivers valuable benefits to end-user organizations, including high CPU efficiency, reduced energy consumption, and reduced rack-space requirements. OFS offers these benefits on commodity servers for academic, engineering, enterprise, research and cloud applications. OFS also provides investment protection as parallel computing and storage evolve toward exascale computing, and as networking speeds move toward 10 Gigabit Ethernet and 40 Gigabit InfiniBand in the enterprise data center. https://www.openfabrics.org/index.php/openfabrics-software.html
Facebook Introduces A Flexible NVMe JBOF
At the Open Compute Project Summit today Facebook is announcing its flexible NVMe JBOF (just a bunch of flash), Lightning.
Facebook is contributing this new JBOF to the Open Compute Project. Benefits include:
Lightning can support a variety of SSD form factors, including 2.5", M.2, and 3.5" SSDs. Lightning will support surprise hot-add and surprise hot-removal of SSDs to make field replacements as simple and transparent as SAS JBODs. Lightning will use an ASPEED AST2400 BMC chip running OpenBMC. Lightning will support multiple switch configurations, which allows us to support different SSD configurations (for example, 15x x4 SSDs vs. 30x x2 SSDs) or different head node to SSD mappings without changing HW in any way. Lightning will be capable of supporting up to four head nodes. By supporting multiple head nodes per tray, we can adjust the compute-to-storage ratio as needed simply by changing the switch configuration.
The HPC Project is focused on developing a fully open heterogeneous computing, networking and fabric platform Optimized for a multi-node processor that is agnostic to any computing group using x86, ARM, Open Power, FPGA, ASICs, DSP, and GPU silicon on an hardware platform
This is the reason why Eurotech decided to concentrate on cooling and power conversion.Our systems are entirely liquid cooled with water up to 50 °C, making the use of air conditioning redundant. They also use a technology of power conversion with an efficiency of 97%.this means an Eurotech based data center can have a PUE of 1,05!
https://www.eurotech.com/en/hpc/hpcomputing/energy+efficient+hpc
PUE for a dedicated building is the total facility energy divided by the IT equipment energy. PUE is an end-user metric used to help improve energy
efficiency in data center operations.
https://eehpcwg.llnl.gov/documents/infra/09_pue_past-tue_future.pdf
green HPC
The OpenFabrics Enterprise Distribution (OFED™)/OpenFabrics Software is open-source software for RDMA and kernel bypass applications. OFS is used in business, research and scientific environments that require highly efficient networks, storage connectivity, and parallel computing. The software provides high-performance computing sites and enterprise data centers with flexibility and investment protection as computing evolves towards applications that require extreme speeds, massive scalability, and utility-class reliability.
OFS includes kernel-level drivers, channel-oriented RDMA and send/receive operations, kernel bypasses of the operating system, both kernel and user-level application programming interface (API) and services for parallel message passing (MPI), sockets data exchange (e.g., RDS, SDP), NAS and SAN storage (e.g. iSER, NFS-RDMA, SRP) and file system/database systems.
The network and fabric technologies that provide RDMA performance with OFS include legacy 10 Gigabit Ethernet, iWARP for Ethernet, RDMA over Converged Ethernet (RoCE), and 10/20/40 Gigabit InfiniBand.
OFS is available for many Linux and Windows distributions, including Red Hat Enterprise Linux (RHEL), Novell SUSE Linux Enterprise Distribution (SLES), Oracle Enterprise Linux (OEL) and Microsoft Windows Server operating systems. Some of these distributions ship OFS in-box.
OFS for High-Performance Computing. OFS is used in high-performance computing for breakthrough applications that require high-efficiency computing, wire-speed messaging, microsecond latencies and fast I/O for storage and file systems
OFS for Enterprise Data Centers
OFS delivers valuable benefits to end-user organizations, including high CPU efficiency, reduced energy consumption, and reduced rack-space requirements.
https://www.openfabrics.org/index.php/openfabrics-software.html
EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High-Performance Computing (HPC) systems in an efficient way.
https://github.com/easybuilders/easybuild
Using the Open Build Service (OBS) to manage build process
Welcome to the OpenHPC site. OpenHPC is a collaborative, community effort that initiated from a desire to aggregate a number of common ingredients required to deploy and manage High-Performance Computing (HPC) Linux clusters including provisioning tools, resource management, I/O clients, development tools, and a variety of scientific libraries. Packages provided by OpenHPC have been pre-built with HPC integration in mind with a goal to provide re-usable building blocks for the HPC community
https://openhpc.community/
This stack provides a variety of common, pre-built ingredients required to deploy and manage an HPC Linux cluster including provisioning tools, resource management, I/O clients, runtimes, development tools, and a variety of scientific libraries.
https://github.com/openhpc/ohpc
OpenHPC containers
OpenHPC software stack
OpenHPC v1.3 software components
OpenHPC (v1.3.4)
Cluster Building Recipes
CentOS7.4 Base OS Warewulf/PBS Professional Edition for Linux* (x86 64)
This guide presents a simple cluster installation procedure using components from the OpenHPC software
stack. OpenHPC represents an aggregation of a number of common ingredients required to deploy and manage an HPC Linux* cluster including
provisioning tools,
resource management,
I/O clients,
development tools,
and a variety of scientific libraries
These packages have been pre-built with HPC integration in mind while conforming to common Linux distribution standards.
This guide is targeted at experienced Linux system administrators for HPC environments. Knowledge of
software package management, system networking, and PXE booting are assumed.
Unless specified otherwise, the examples presented are executed with elevated (root) privileges.
The examples also presume the use of the BASH login shell
This installation recipe assumes the availability of a single head node master, and four compute nodes.
4-node stateless cluster installation
The master node serves as the overall system management server (SMS) and is provisioned with CentOS7.4 and is
subsequently configured to provision the remaining compute nodes with Warewulf in a stateless configuration
The terms master and SMS are used interchangeably in this guide. For power management, we assume that
the compute node baseboard management controllers (BMCs) are available via IPMI from the chosen master
host.
For file systems, we assume that the chosen master server will host an NFS file system that is made
available to the compute nodes. Installation information is also discussedto optionally mount a parallel file system and in this case, the parallel file system is assumed to exist previously.
The master host requires at least two Ethernet interfaces with eth0 connected to
the local data center network and eth1 used to provision and manage the cluster backend Two logical IP interfaces are expected to each compute node: the first is the standard Ethernet interface that
will be used for provisioning and resource management. The second is used to connect to each host’s BMC
and is used for power management and remote console access. The second is used to connect to each host’s BMC
and is used for power management and remote console access. Physical connectivity for these two logical
IP networks is often accommodated via separate cabling and switching infrastructure; however, an alternate configuration can also be accommodated via the use of a shared NIC, which runs a packet filter to divert
management packets between the host and BMC In addition to the IP networking, there is an optional high-speed network (InfiniBand or Omni-Path
in this recipe) that is also connected to each of the hosts. This high-speed network is used for application
message passing and optionally for parallel file system connectivity as well (e.g. to existing Lustre or BeeGFS
storage targets)
In an external setting, installing the desired BOS on a master SMS host typically involves booting from a
DVD ISO image on a new server. With this approach, insert the CentOS7.4 DVD, power cycle the host, and
follow the distro provided directions to install the BOS on your chosen master host
http://openhpc.community/downloads/
xCAT offers complete management for HPC clusters, RenderFarms, Grids, WebFarms, Online Gaming Infrastructure, Clouds, Datacenters etc. xCAT provides an extensible framework that is base
https://xcat.org/
XCAT CLUSTER MANAGEMENT
The following diagram shows the basic structure of xCAT
PBS Professional software optimizes job scheduling and workload management in high-performance computing (HPC) environments
https://www.pbspro.org/
Warewulf is a scalable systems management suite originally developed to manage large high-performance Linux clusters.
http://warewulf.lbl.gov/
Hence the emergence of the tangible need for supercomputers, able to perform a large number of checks required in an extremely short time.
In addition, modern methods for the identification of cyber crimes increasingly involve techniques featuring cross-analysis of data coming from several different sources – and these techniques further increase the computing capacity required.
The nowadays security solutions require a mix of software and hardware to boost the power of security algorithms, analyze enormous data quantities in real time, rapidly crypt and decrypt data, identify abnormal patterns, check identities, simulate attacks, validate software security proof, petrol systems, analyze video material and many more additional actions
The Aurora on-node FPGAs allow for secure encryption/decryption to coexist with the fastest processors in the market. The Aurora Dura Mater rugged HPC, an outdoor portable highly robust system, can be deployed closer to where the data is captured to allow a number of on filed operations like for instance the processing of video surveillance data, bypassing network bottlenecks.
https://www.eurotech.com/en/hpc/industry+solutions/cyber+security
This is particularly true with respect to operational cybersecurity that, at best,is seen as a necessary evil, and considered as generally restrictive of performance and/or functionality. As a representative high-performance open-computing site, NERSC has decided to place as few roadblocks as possible for access to site computational and networking resources
Activate tools such as SELinux that will control the access through predefined roles, but most HPC centers do not activate it as it breaks down the normal operation of the Linux clusters and it becomes harder to debug when applications don’t work as expected
The traffic among HPC systems connected through public or private network now is exclusively through encrypted protocols using OpenSSL such as ssh, sftp, https etc
HPC resources are running some version of Linux operating system they all invariably run Iptables based firewall at the host level, which is the primary tool to restrict
access to service ports from outside network.
https://idre.ucla.edu/sites/default/files/cybersecurity_hpc.pdf?x83242
Some of the HPC centers do not rely on users in protecting their password. So they implemented what is calledOne Time Password (OTP) where users are given small calculator-like devices to generate a random key to login to the system. However, that is inconvenient for users, as they have to carry this device all the time with them and also add to the operating cost of HPC systems
Protecting a High-Performance Computing (HPC) cluster against real-world cyber threats is a critical task nowadays, with the increasing trend to open and share computing resources.
Security of the clusters accesses usually relies on Discretionary Access Control (DAC) and sandboxing such as chroot, BSD Jail [1] or Vserver [2]. The DAC model has proven to be fragile [3]. Moreover, virtual containers do not protect against privilege escalation where users can get administration privileges in order to compromise data
confidentiality and integrity
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5633784
GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. The GridFTP protocol is based on FTP, the highly-popular Internet file transfer protocol.
What is GridFTP? •A protocol for efficient file transfer •An extension of FTP •Currently the de facto standard for moving large files across the Internet
Using GridFTP: Some background •You move files between “endpoints” identified by a URI •One of these endpoints is often your local machine, but GridFTP can also be used to move data between two remote endpoints •X.509 (“grid”) certificates are used for authentication to a particular end-point –Usually you’ll have one certificate that you use at both end-points, but this need not be the case
Features of GridFTP•Security with GSI, The Grid Security Infrastructure •Third party transfers •Parallel and striped transfer•Partial file transfer•Fault tolerance and restart •Automatic TCP optimisation
https://www.eudat.eu/sites/default/files/GridFTP-Intro-Rome.pdf
This paper describes FLEXBUS, a flexible, high-performance onchip communication architecture featuring a dynamically configurable topology. FLEXBUS is designed to detect run-time variations in communication traffic characteristics, and efficiently adapt the topology of the communication architecture, both at the system-level, through dynamic bridge by-pass, as well as at the component-level, using component re-mapping.
Intel® Omni-Path Fabric (Intel® OP Fabric) has a robust architecture and feature set to meet the current and future demands of high performance computing (HPC) at a cost-competitive price to today's fabric. Fabric builders members are working to enable world-class solutions based on the Intel® Omni-Path Architecture (Intel® OPA) with a goal of helping end users build the best environment to meet their HPC needs.
https://www.intel.com/content/www/us/en/products/network-io/high-performance-fabrics.html
High-Performance Computing Solutions
In-memory high performance computing
Build an HPE and Intel® OPA foundation
Solve supercomputing challenges
https://www.hpe.com/us/en/solutions/hpc-high-performance-computing.html
Memory-Driven Computing: the ultimate composable infrastructure
RoCE vs. iWARP Q&A
two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP.
Q. Does RDMA use the TCP/UDP/IP protocol stack?
A. RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don’t use Ethernet.
Q. Can Software Defined Networking features like VxLANsbe implemented on RoCE/iWARP NICs?
A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality.
Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP?
A. RDMAP/DDP/MPA arestacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model.
Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP?
A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most. iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration as well as out-of-the-box support across LAN/MAN/WAN networks.
Q. Is iWARP a lossless or losssy protocol?
A. iWARPutilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality.
http://sniaesfblog.org/roce-vs-iwarp-qa/
A Brief about RDMA, RoCE v1 vs RoCE v2 vs iWARP
For those who are wondering what these words are, this is a post about Networking and an overview can be, how to increase your network speed without adding new servers or IB over your network, wherein IB stands for InfiniBand which is basically a networking communication standard majorly used in HPC that features very high throughput and very low latency, for what HPC is.
DBC
Data-Center Bridging (DCB) is an extension to the Ethernet protocol that makes dedicated traffic flows possible in a converged network scenario. DCB distinguishes traffic flows by tagging the traffic with a specific value (0-7) called a “CoS” value which stands for Class of Service. CoS values can also be referred to as “Priority” or “Tag”.
PFC
In standard Ethernet, we have the ability to pause traffic when the receive buffers are getting full. The downside of ethernet pause is that it will pause all traffic on the link. As the name already gives it away, Priority-based Flow Control (PFC) can pause the traffic per flow based on that specific Priority, in other words; PFC creates Pause Frames based on a traffic CoS value.
ETS
With DCB in-place the traffic flows are nicely separated from each other and can pause independently because of PFC but, PFC does not provide any Quality-of-Service (QoS). If your servers are able to fully utilize the full network pipe with only storage traffic, other traffic such as cluster heartbeat or tenant traffic may come in jeopardy.
The purpose of Enhanced Transmission Selection (ETS) is to allocate bandwidth based on the different priority settings of the traffic flows, this way the network components share the same physical pipe but ETS makes sure that everyone gets the share of the pipe specified and prevent the “noisy neighbor” effect
Data Center Bridging Exchange Protocol
This protocol is better known as DCBX as in also an extension on the DCB protocol, where the “X” stands for eXchange. DCBX can be used to share information about the DCB settings between peers (switches and servers) to ensure you have a consistent configuration on your network
http://innovativebit.blogspot.com/2018/06/a-brief-about-rdma-roce-v1-vs-roce-v2.html
File Storage Types and Protocols for Beginners
DAS SCSI
FC SAN
iSCSI
NAS
SMB Multichannel and SMB Direct
SMB Multichannel is the feature responsible for detecting the RDMA capabilities of network adapters to enable SMB Direct. Without SMB Multichannel, SMB uses regular TCP/IP with the RDMA-capable network adapters (all network adapters provide a TCP/IP stack along with the new RDMA stack).
SMB Direct is SMB 3 traffic over RDMA. So, what is RDMA? DMA stands for Direct Memory Access. This means that an application has access (read/write) to a hosts memory directly with CPU intervention. If you do this between hosts it becomes Remote Direct Memory Access (RDMA).
DDN's GRIDScaler gets bigger scale-up and scale-out muscles
MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard
https://www.mpich.org/
Catalyst UK Program successfully drives development of Arm-based HPC systems
The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.
https://www.open-mpi.org/
Best Practice Guide – Parallel I/O
Can (HPC)Clouds supersede traditional High Performance Computing?
The MVAPICH2 software, based on MPI 3.1 standard, delivers the best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE networking technologies.
http://mvapich.cse.ohio-state.edu/
Building HPC Cloud with InfiniBand: Efficient Support in MVAPICH2, Xiaoyi Lu
MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds
Keynote: Designing HPC, BigD & DeepLMWare for Exascale - DK Panda, The Ohio State University
OpenStack and HPC Network Fabrics
HPC Networking in OpenStack: Part 1
OpenStack and HPC Workload Management
A Self-Adaptive Network for HPC Clouds: Architecture, Framework, and Implementation
OpenStack and Virtualised HPC
Making Containers Easier with HPC Container Maker
SUSE Linux Enterprise High PerformanceComputing
OSC's Owens cluster being installed in 2016 is a Dell-built, Intel® Xeon® processor-based supercomputer.
MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand
Introduction to Linux & HPC
One Library with Multiple Fabric Support
Intel® MPI Library is a multifabric message-passing library that implements the open-source MPICH specification. Use the library to create, maintain, and test advanced, complex applications that perform better on HPC clusters based on Intel® processors.
https://software.intel.com/en-us/mpi-library
Supermicro High Performance Computing (HPC) Solutions