Designing Databases for Future High-Performance Networks م.م بان كاظم العماري Date: 05/10/2022 | Views: 299

Share in :

Distributed query and transaction processing has been an active field of research ever since the volume of the
data to be processed outgrew the storage and processing capacity of a single machine. Two platforms of choice
for data processing are database appliances, i.e., rack-scale clusters composed of several machines connected
through a low-latency network, and scale-out infrastructure platforms for batch processing, e.g., data analysis
applications such as Map-Reduce.
A fundamental design rule on how software for these systems should be implemented is the assumption that
the network is relatively slow compared to local in-memory processing. Therefore, the execution time of a query
or a transaction is assumed to be dominated by network transfer times and the costs of synchronization. However,
data processing systems are starting to be equipped with high-bandwidth, low-latency interconnects that can
transmit vast amounts of data between the compute nodes and provide single-digit microsecond latencies. In
light of this new generation of network technologies, such a rule is being re-evaluated, leading to new types of
database algorithms [6, 7, 29] and fundamental system design changes [8, 23, 30].
Modern high-throughput, low-latency networks originate from high-performance computing (HPC) systems.
Similar to database systems, the performance of scientific applications depends on the ability of the system to
move large amounts of data between compute nodes. Several key features offered by these networks are (i) userlevel networking, (ii) an asynchronous network interface that allows the algorithm to interleave computation and
communication, and (iii) the ability of the network card to directly access regions of main memory without going
through the processor, i.e., remote direct memory access (RDMA). To leverage the advantages of these networks,2 Background and Definitions
In this section, we explain how the concepts of Remote Direct Memory Access (RDMA), Remote Memory
Access (RMA), and Partitioned Global Address Space (PGAS) relate to each other. Furthermore, we include an
overview of several low-latency, high-bandwidth network technologies implementing these mechanisms.
2.1 Remote Direct Memory Access
Remote Direct Memory Access (RDMA) is a hardware mechanism through which the network card can directly
access all or parts of the main memory of a remote node without involving the processor. Bypassing the CPU and
the operating system makes it possible to interleave computation and communication, thereby avoiding copying
data across different buffers within the network stack and user space, which significantly lowers the costs of
large data transfers and reduces the end-to-end communication latency.
In many implementations, buffers need to be registered with the network card before they are accessible
over the interconnect. During the registration process, the memory is pinned such that it cannot be swapped out,
and the necessary address translation information is installed on the card, operations that can have a significant
overhead [14]. Although this registration process is needed for many high-speed networks, it is worth noting
that some network implementations also support registration-free memory access [10, 27].
RDMA as a hardware mechanism does not specify the semantics of a data transfer. Most modern networks provide support for one-sided and two-sided memory accesses. Two-sided operations represent traditional message-passing semantics in which the source process (i.e., the sender of a message) and the destination
process (i.e., the receiver of a message) are actively involved in the communication and need to be synchronized;
i.e., for every send operation there must exist exactly one corresponding receive operation. One-sided operations
on the other hand, represent memory access semantics in which only the source process (i.e., the initiator of a
request) is involved in the remote memory access. In order to efficiently use remote one-sided memory operations, multiple programming models have been developed, the most popular of which are the Remote Memory2.2 Remote Memory Access
Remote Memory Access (RMA) is a shared memory programming abstraction. RMA provides access to remote
memory regions through explicit one-sided read and write operations. These operations move data from one
buffer to another, i.e., a read operation fetches data from a remote machine and transfers it to a local buffer,
while the write operation transmits the data in the opposite direction. Data located on a remote machine can
therefore not be loaded immediately into a register, but needs to be first read into a local main memory buffer.
Using the RMA memory abstractions is similar to programming non-cache-coherent machines in which data
has to be explicitly loaded into the cache-coherency domain before it can be used and changes to the data have
to be explicitly flushed back to the source in order for the modifications to be visible on the remote machine.
The processes on the target machine are generally not notified about an RMA access, although many interfaces offer read and write calls with remote process notifications. Apart form read and write operations, some
RMA implementations provide support for additional functionality, most notably remote atomic operations.
Examples of such atomic operations are remote fetch-and-add and compare-and-swap instructions.
RMA has been designed to be a thin and portable layer compatible with many lower-level data movement
interfaces. RMA has been adopted by many libraries such as ibVerbs [17] and MPI-3 [25] as their one-sided
communication and remote memory access abstraction.
RDMA-capable networks implement the functionality necessary for efficient low-latency, high-bandwidth
one-sided memory accesses. It is worth pointing out that RMA programming abstractions can also be used over
networks which do not support RDMA, for example by implementing the required operations in software [26] Partitioned Global Address Space
Partitioned Global Address Space (PGAS) is a programming language concept for writing parallel applications
for large distributed memory machines. PGAS assumes a single global memory address space that is partitioned
among all the processes. The programming model distinguishes between local and remote memory. This can
be specified by the developer through the use of special keywords or annotations [9]. PGAS is therefore usually found in the form of a programming language extension and is one of the main concepts behind several
languages, such as Co-Array Fortran or Unified Parallel C.
Local variables can only be accessed by the local processes, while shared variables can be written or read
over the network. In most PGAS languages, both types of variables can be accessed in the same way. It is the
responsibility of the compiler to add the necessary code to implement a remote variable access. This means that
from a programming perspective, a remote variable can directly be a assigned to a local variable or a register
and does not need to be explicitly loaded into main memory first as is the case with RMA.
When programming with a PGAS language, the developer needs to be aware of implicit data movement when
accessing shared variable data, and careful non-uniform memory access (NUMA) optimizations are required for
applications to achieve high performance.
2.4 Low-latency, High-bandwidth Networks
Many high-performance networks offer RDMA functionality. Examples of such networks are InfiniBand [19]
and Cray Aries [2]. Both networks offer a bandwidth of 100 Gb/s or more and a latency in the single-digit
microsecond range. However, RDMA is not exclusively available on networks originally designed for supercomputers: RDMA over Converged Ethernet (RoCE) [20] hardware adds RDMA capabilities to a conventional
Ethernet network.

Access (RMA) and the Partitioned Global Address Space (PGAS) concepts