0% found this document useful (0 votes)
69 views28 pages

HCIA-Cloud Computing-Chapter3

The document discusses data storage technologies used in cloud computing, including an overview of storage basics, the history of storage development from traditional to cloud storage, mainstream disk types, and storage networking types like DAS and NAS.

Uploaded by

islem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views28 pages

HCIA-Cloud Computing-Chapter3

The document discusses data storage technologies used in cloud computing, including an overview of storage basics, the history of storage development from traditional to cloud storage, mainstream disk types, and storage networking types like DAS and NAS.

Uploaded by

islem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

HCIA-Cloud Computing V5.

0 Learning Guide Page 44

3 Storage Technology Basics

Data is the most important asset for every user. This chapter describes how and where
data is stored, and provide the key data storage technologies in cloud computing.

3.1 Storage Basics


3.1.1 What Is Storage

Figure 3-1 Data storage procedure


Figure 3-1 shows the generation, processing, and management of data. Storage is used
to manage data.
Storage is defined in a narrow sense and broad sense.
The narrow-sense storage refers to specific storage devices, such as CDs, DVDs, Zip drives,
tapes, and disks.
The broad-sense storage consists of the following four parts:
⚫ Storage hardware (disk arrays, controllers, disk enclosures, and tape libraries)
⚫ Storage software (backup software, management software, and value-added
software such as snapshot and replication)
⚫ Storage networks (HBAs, Fibre Channel switches, as well as Fibre Channel and SAS
cables)
⚫ Storage solutions (centralized storage, archiving, backup, and disaster recovery)
HCIA-Cloud Computing V5.0 Learning Guide Page 45

3.1.2 History of Storage

Figure 3-2 History of storage


As shown in Figure 3-2, the storage architecture has gone through the following
development phases: traditional storage, external storage, storage network, distributed
storage, and cloud storage.
⚫ Traditional storage: refers to individual disks. In 1956, IBM invented the world's first
mechanical hard drive that has fifty 24-inch platters and the total storage capacity of
just 5 MB. It is about the size of two refrigerators and weighs more than a ton. It was
used in the industrial field at that time and was independent of the mainframe.
⚫ External storage refers to direct-attached storage. The earliest form of external
storage is JBOD, which stands for Just a Bunch Of Disks. JBOD is identified by the
host as a stack of independent disks. It provides large capacity but low security.
⚫ Storage network: A storage area network (SAN) is a typical storage network that
transmits data mainly over a Fibre Channel network. Then, IP SANs emerge.
⚫ Distributed storage and cloud storage: Distributed storage uses general-purpose
servers to build storage pools and is more suitable for cloud computing.

3.1.2.1 Storage Development - from Server Attached Storage to


Independent Storage Systems
HCIA-Cloud Computing V5.0 Learning Guide Page 46

Figure 3-3 Development from server attached storage to independent


storage systems
In the early phase of enterprise storage, disks are built in servers. As storage technologies
develop, the limitations of this architecture gradually emerge.
⚫ Disks in the server are prone to become system performance bottleneck.
⚫ The number of disk slots is limited, thereby limiting capacity.
⚫ Data is stored on individual disks, resulting in poor reliability.
⚫ Storage space utilization is low.
⚫ Data is scattered in local storage systems.
To meet new storage requirements, external disk arrays are introduced. Just a Bunch Of
Disks, or JBOD combines multiple disks to provide storage resources externally. It just
refers to a collection of disks without control software to coordinate and control
resources and does not support redundant array of independent disks, or RAID. This
architecture resolves the problem of limited disk slot quantities of the server, thereby
improving system capacity.
As the RAID technology emerges, disk arrays with the RAID technology used become
smarter. RAID resolves the problems of the limited disk interface performance and the
poor reliability of individual-disk storage.

3.1.2.2 Storage Development: from Independent Storage Systems to


Network Shared Storage

Figure 3-4 Development from independent storage systems to network


shared storage
As mentioned in the previous section, the direct connection between the storage and
server through the controller resolves the problems caused by the limited disk slot
quantities, individual-disk storage, and limited disk interface performance.
HCIA-Cloud Computing V5.0 Learning Guide Page 47

However, other problems remain, such as low storage space utilization, decentralized
data management, and inconvenient data sharing. We will learn how the network shared
storage such as SAN and NAS solves these pain points.

3.1.3 Mainstream Disk Types


The concept of disks has been described in 2 Server Basics, and details are not described
herein again.

Figure 3-5 Mainstream disk types


To understand disks, we need to know some disk metrics, such as disk capacity, rotational
speed, average access time, date transfer rate, and input/output operations per second
(IOPS). Rotational speed is specific to HDDs.
⚫ Disk capacity is measured in MB or GB. The factors that affect the disk capacity
include the single platter capacity and the number of platters.
⚫ Rotational speed is the number of rotations made by disk platters per minute. The
unit is rotation per minute (rpm). In most cases, the rotational speed of a disk
reaches 5400 rpm or 7200 rpm. The disk that uses the SCSI interface reaches 10,000
rpm to 15,000 rpm.
⚫ Average access time is the average seek time plus the average wait time.
⚫ Data transfer rate of a disk is the speed at which data is read from or written to the
disk. It is measured in MB/s. The data transfer rate consists of the internal data
transfer rate and the external data transfer rate.
⚫ IOPS indicates the number of input/output operations or read/write operations per
second. It is a key metric to measure disk performance. For applications with
frequent random read/write operations, such as online transaction processing (OLTP),
IOPS is a key metric. Another key metric is the data throughput, which indicates the
amount of data that can be successfully transferred per unit time. For applications
that require a large number of sequential read/write operations, such as video
editing and video on demand (VoD) at TV stations, the throughput is more of a
focus.
When measuring the performance of a disk or storage system, we usually consider the
following metrics: average access time, data transfer rate, and IOPS. To be specific,
shorter average access time, higher data transfer rate, and higher IOPS indicate better
disk performance.
HCIA-Cloud Computing V5.0 Learning Guide Page 48

3.1.4 Storage Networking Types


3.1.4.1 Introduction to DAS
As shown in Figure 3-6, direct attached storage (DAS) is a type of storage that is
attached directly to a computer through the SCSI or Fibre Channel interface. DAS does
not go through a network so that only the host to which the storage device is attached
can access it. That is, if a server is faulty, the data in the DAS device that connects to the
server is unavailable. Common interfaces include small computer systems interface
(SCSI), serial attached SCSI (SAS), external SATA (eSATA), serial ATA (SATA), Fibre
Channel, USB 3.0 and Thunderbolt.

Figure 3-6 Architecture of DAS


The DAS device communicates with the server or host through SAS channels (3 Gbit/s, 6
Gbit/s, and 12 Gbit/s). However, with the strengthening of CPU processing capability,
expanding of storage disk space, and increasing of disk quantities in a disk array, the SAS
channel will become the I/O bottleneck. Limited SAS interface resources of a server host
limit the channel bandwidth.
HCIA-Cloud Computing V5.0 Learning Guide Page 49

3.1.4.2 Introduction to NAS

Figure 3-7 Architecture of NAS


As shown in 3.1.4.2 Figure 3-7, network attached storage (NAS) is a type of storage that
connects to a group of computers through a standard network (for example, an Ethernet
network). A NAS device has a file system and an assigned IP address, and may be
regarded as a shared disk in Network Neighborhood.
Developing networks drove the need for large-scale data sharing and exchange, leading
to dedicated NAS storage devices.
Access mode: Multiple front-end servers share space on back-end NAS storage devices
using CIFS or NFS. Concurrent read and write operations can be performed on the same
directory or file.
In the NAS system, Linux clients mainly use Network File System (NFS) protocol, and
Windows clients mainly use Common Internet File System (CIFS) protocol. The NAS file
system is on the back-end storage device.
NFS is an Internet standard protocol created by Sun Microsystems in 1984 for file sharing
between systems on a local area network (LAN).
It uses the Remote Procedure Call (RPC) protocol.
⚫ RPC provides a set of operations to achieve remote file access that are not restricted
by machines, operating systems (OSs), and lower-layer transmission protocols. It
allows remote clients to access storage over a network like accessing a local file
system.
⚫ The NFS client sends an RPC request to the NFS server. The server transfers the
request to the local file access process, reads the local disk files on the server, and
returns the files to the client.
HCIA-Cloud Computing V5.0 Learning Guide Page 50

CIFS is a network file system protocol used for sharing files and printers between
machines on a network. It is mainly used to share network files between hosts running
Windows.
NAS is a file-level storage architecture that meets the requirements of work teams and
departments on quick storage capacity expansion. Currently, NAS is widely used to share
documents, images, and movies. NAS supports multiple protocols (such as NFS and CIFS)
and supports various OSs. Users can conveniently manage NAS devices by using Internet
Explorer or Netscape on any work station.

3.1.4.3 Introduction to SAN

Figure 3-8 Architecture of SAN


The storage area network (SAN) is a dedicated storage network that connects one or
more network storage devices to servers. It is a high-performance and dedicated storage
network used between servers and storage resources. In addition, it is a back-end storage
network independent from a LAN. The SAN adopts a scalable network topology for
connecting servers and storage devices. The storage devices do not belong to any of the
servers but can be shared by all the servers on the network.
SAN features fast data transmission, high flexibility, and reduced network complexity. It
eliminates performance bottlenecks of the traditional architecture and massively
improves the backup and disaster recovery efficiency of remote systems.
A SAN is a network architecture that consists of storage devices and system components,
including servers that need to use storage resources, host bus adapters (HBAs) that
connect storage devices, and Fibre Channel switches.
On a SAN, all communication related to data storage is implemented on a network
independent of the application network. Therefore, SAN improves I/O capabilities of the
entire network without affecting the existing application network, offers a backup
connection for the storage system, and supports high availability (HA) cluster systems.
HCIA-Cloud Computing V5.0 Learning Guide Page 51

With the development of SAN technologies, three SAN types are made available: FC SAN,
IP SAN, and SAS SAN. The following describes FC SAN and IP SAN.
3.1.4.3.1 Introduction to FC SAN

Figure 3-9 Architecture of FC SAN


As shown in Figure 3-9, on an FC SAN, each storage server is configured with two
network interface adapters. One is a common network interface card (NIC) that connects
to the service IP network. The server interacts with the client through this NIC. The other
is an HBA that connects to the FC SAN. The server communicates with the storage
devices on the FC SAN through this adapter.
HCIA-Cloud Computing V5.0 Learning Guide Page 52

3.1.4.3.2 Introduction to IP SAN

Figure 3-10 Architecture of IP SAN


IP SAN has become a popular network storage technology in recent years. The early
SANs are all FC SANs, where data is transferred in the fibre channel in blocks. Due to the
incompatibility between FC protocol and the IP protocol, customers who want to use FC
SAN have to purchase its devices and components. As a result, a large number of small
and medium-sized users may flinch at its high cost and complicated configuration.
Therefore, FC SAN is mainly used for middle- and high-end storage that requires high
performance, redundancy, and availability. To popularize SANs and leverage the
advantages of SAN architecture, technicians consider to combine SANs with prevailing
and affordable IP networks. Therefore, the IP SAN that uses the existing IP network
architecture is introduced. IP SAN is a combination of the standard TCP/IP protocol with
the SCSI instruction set and implements block-level data storage based on the IP
network.
The difference between IP SAN and FC SAN lies in the transfer protocol and medium.
Common IP SAN protocols include Internet SCSI (iSCSI), Fibre Channel over IP (FCIP), and
Internet Fibre Channel Protocol (iFCP). iSCSI is the fastest growing protocol standard. In
most cases, IP SAN refers to iSCSI-based SAN.
The iSCSI-based SAN uses an iSCSI initiator (server) and an iSCSI target (storage device)
on the IP network to form a SAN.
HCIA-Cloud Computing V5.0 Learning Guide Page 53

3.1.4.3.3 Comparison Among Storage Networking Types

Figure 3-11 Comparison among storage networking types


Figure 3-11 describes the three storage networking types. SAN and NAS complement
each other to provide access to different types of data.

3.1.5 Storage Types


3.1.5.1 Centralized Storage

Figure 3-12 Architecture of centralized storage


By "centralized", it is meant that all resources are centrally deployed and are used to
provide services over a unified interface. Centralized storage means that all physical disks
are centrally deployed in the disk enclosure and are used to provide storage services
externally through the controller. Centralized storage typically refers to disk arrays.
In terms of technical architectures, centralized storage is classified into SAN and NAS.
SANs can be classified into Fibre Channel SAN (FC SAN), Internet Protocol SAN (IP SAN),
HCIA-Cloud Computing V5.0 Learning Guide Page 54

and Fibre Channel over Ethernet SAN (FCoE SAN). Currently, FC SAN and IP SAN
technologies are mature, and FCoE SAN is still in the early stage of its development.
A disk array combines multiple physical disks into a single logical unit. Each disk array
consists of one controller enclosure and multiple disk enclosures. This architecture
delivers an intelligent storage space featuring high availability, high performance, and
large capacity.

3.1.5.2 Distributed Storage

Figure 3-13 Architecture of distributed storage


Unlike centralized storage, distributed storage does not store data on one or more
specific nodes. It virtualizes all available space distributed on each host of an enterprise
to a virtual storage device. In this way, the data stored in this virtual storage device is
also distributed over the storage network.
As shown in Figure 3-13, distributed storage uses general-purpose servers rather than
storage devices. A distributed storage system does not have any controller enclosure or
disk enclosure. All disk storage resources are delivered by general-purpose x86 servers.
Clients are delivered by the distributed storage system to identify and manage disks, as
well as establish data routing and process read/write I/Os.
The distributed storage client mode has advantages and disadvantages. In terms of
capacity expansion, an x86 server with a client installed can be a part of the distributed
storage system. This mode delivers great scalability. However, in addition to the
applications running on the server, the client software installed on the server also
consumes compute resources. When you plan a distributed storage system, you must
reserve certain amounts of compute resources on servers you intend to add to this
system. Therefore, this mode has certain requirements on the hardware resources of the
server. In a traditional centralized storage system, data is read and written by controllers.
However, the number of controllers is limited. In a distributed storage system, servers
with clients can read and write data, breaking the limit on the number of controllers and
improving the read and write speed to some extent. However, the paths for reading and
writing data need to be calculated each time data is read and written. If there are too
many clients, the path calculating is complicated. When the optimum performance is
reached, adding more clients cannot further improve the performance.
HCIA-Cloud Computing V5.0 Learning Guide Page 55

For high data availability and security, the centralized storage system uses the RAID
technology. RAID can be implemented by hardware and software. All disks must be
deployed on the same server (hardware RAID requires a unified RAID card, and software
RAID requires a unified OS). Disks of a distributed storage system are distributed in
different servers, resulting in that the RAID mechanism is unavailable.
Therefore, in a distributed storage system, a copying mechanism is introduced to ensure
high data reliability. The copying mechanism copies and stores data on different servers.
If a server is faulty, data will not be lost.

3.1.5.3 Storage Service Types


3.1.5.3.1 Block Storage
Block storage commonly uses an architecture that connects storage devices and
application servers over a network. This network is used only for data access between
servers and storage devices. When there is an access request, data can be transmitted
quickly between servers and backend storage devices as needed. From a client's
perspective, block storage functions the same way as disks. One can format a disk with
any file system and then mount it. A major difference between block storage and file
storage is that block storage provides storage spaces only, leaving the rest of the work,
such as file system formatting and management, to the client.
Block storage uses evenly sized blocks to store structured data. In block storage, data is
stored without any metadata. This makes block storage useful when applications need to
strictly control the data structure. A most common usage is for database. Databases can
read and write structured data faster with raw block devices.
Currently, block storage is usually deployed in FC SAN and IP SAN based on the protocols
and connectors used. FC SAN uses the Fibre Channel protocol to transmit data between
servers (hosts) and storage devices, whereas, IP SAN uses the IP protocol for
communication. The FC technology can meet the growing needs for high-speed data
transfer between servers and large-capacity storage systems. With the FC protocol, data
can be transferred faster with low protocol overheads, while maintaining certain network
scalability.
File Storage has the following advantages:
⚫ Offers long-distance data transfer with a high bandwidth and a low transmission bit
error rate.
⚫ Based on the SAN architecture and massive addressable devices, multiple servers can
access a storage system over the storage network at the same time, eliminating the
need for purchasing storage devices for every server. This reduces the heterogeneity
of storage devices and improves storage resource utilization.
⚫ Protocol-based data transmission can be handled by the HBA, occupying less CPU
resources.
In a traditional block storage environment, data is transmitted over the fibre channel via
block I/Os. To leverage the advantages of FC SAN, enterprises need to purchase
additional FC components, such as HBAs and switches. Enterprises usually have an IP
network-based architecture. As technologies evolve, block I/Os now can be transmitted
over the IP network, which is called IP SAN. With IP SAN, legacy infrastructure can be
reused, which is far more economical than investing in a brand new SAN environment. In
HCIA-Cloud Computing V5.0 Learning Guide Page 56

addition, many remote and disaster recovery solutions are also developed based on the IP
network, allowing users to expand the physical scope of their storage infrastructure.
Internet SCSI (iSCSI), Fibre Channel over IP (FCIP), and Fibre Channel over Ethernet
(FCoE) are the major IP SAN protocols.
⚫ iSCSI encapsulates SCSI I/Os into IP packets and transmits them over TCP/IP. iSCSI is
widely used to connect servers and storage devices because it is cost-effective and
easy to implement, especially in environments without FC SAN.
⚫ FCIP allows FCIP entities, such as FCIP gateways, to implement FC switching over IP
networks. FCIP combines the advantages of FC SAN and the mature, widely-used IP
infrastructure. This gives enterprises a better way to use existing investments and
technologies for data protection, storage, and migration.
⚫ FCoE achieves I/O consolidation. Usually, one server in a data center is equipped with
two to four NICs and HBAs for redundancy. If there are hundreds of servers in a data
center, numerous adapters, cables, and switches required make the environment
complex and difficult to manage and expand. FCoE achieves I/O consolidation via
FCoE switches and Converged Network Adapters (CNA). CNAs replace the NICs and
HBAs on the servers and consolidate IP traffic and FC traffic. In this way, servers no
longer need various network adapters and many independent networks, thus the
requirement of NICs, cables, and switches is reduced. This massively lowers the costs
and management overheads.
Block storage is a high-performance network storage, but data cannot be shared
between hosts in block storage. Some enterprise workloads may require data or file
sharing between different types of clients, and block storage cannot do this.
3.1.5.3.2 File Storage
File storage provides file-based, client-side access over the TCP/IP protocol. In file storage,
data is transferred via file I/Os in the local area network (LAN). A file I/O is a high-level
request for accessing a specific file. For example, a client can access a file by specifying
the file name, location, or other attributes. The NAS system records the locations of files
on disks and converts the client's file I/Os to block I/Os to obtain data.
File storage is a commonly used type of storage for desktop users. When you open and
close a document on your computer, you used the file system. Clients can access file
systems on the file storage for file upload and download. Protocols used for file sharing
between clients and storage include CIFS (SMB) and NFS. In addition to file sharing, file
storage also provides file management functions, such as reliability maintenance and file
access control. Although there are differences in managing file storage and local files, file
storage is basically a directory to users. One can use file storage almost the same as
using local files.
Because NAS access requires the conversion of file system format, it is not suitable for
applications using blocks, especially database applications that require raw devices.
File Storage has the following advantages:
⚫ Comprehensive information access: Local directories and files can be accessed by
users on other computers over LAN. Multiple end users can collaborate with each
other based on same files, such as project documents and source code.
⚫ Good flexibility: NAS is compatible with both Linux and Windows clients.
HCIA-Cloud Computing V5.0 Learning Guide Page 57

⚫ Low cost: NAS uses common and low-cost Ethernet components.


3.1.5.3.3 Object Storage
Users who frequently access the Internet and use mobile devices often need object
storage techniques. The core of object storage is to separate the data path from the
control path. Object storage does not provide access to original blocks or files, but to the
entire object data via system-specific APIs. You can access objects using HTTP/REST-
based uniform resource locators (URLs), like you access websites using browsers. Object
storage abstracts storage locations as URLs so that storage capacity can be expanded in a
way that is independent of the underlying storage mechanism. This makes object storage
an ideal way to build a large-scale system with high concurrency.
As the system grows, object storage can still provide a single namespace. This way,
applications or users do not need to worry about which storage system they are using. By
using object storage, you do not need to manage multiple storage volumes like using a
file system. This greatly reduces O&M workloads.
Object storage has many advantages in processing unstructured data over traditional
storage and delivers the advantages of both SAN and NAS. It is independent of platforms
or locations, offering scalability, security, and data sharing:
It can distribute object requests to large-scale storage cluster servers. This enables an
inexpensive, reliable, and scalable storage system for massive amounts of data. Other
advantages of object storage are as follows:
⚫ Security: data consistency and content authenticity. Object storage uses special
algorithms to generate objects with strong encryption. Requests in object storage are
verified in storage devices instead of using external verification mechanisms.
⚫ Platform-independent design: Objects are abstract containers for data (including
metadata and attributes). This allows objects to be shared between heterogeneous
platforms, either locally or remotely, making object storage the best choice in cloud
computing.
⚫ Scalability: The flat address space used enables object storage to store a large
amount of data without compromising performance. Both storage and OSD nodes
can scale independently in terms of performance and capacity.
OSD intelligently manages and protects objects. Its protection and replication capabilities
can be self-healed, enabling data redundancy at a low cost. If one or more nodes in a
distributed object storage system fail, data can still be accessed. In such cases, three data
nodes concurrently transfer data, making the transfer fast. As the number of data node
servers increase, read and write speed up accordingly. In this way, performance is
improved.
3.1.5.3.4 Summary of Block Storage, File Storage, and Object Storage
In block storage, file systems reside on top of application servers, and applications
directly access blocks. The FC protocol is usually used for data transfer, and it has a
higher transmission efficiency than the TCP/IP protocol used in file storage. The header of
each protocol data unit (PDU) in TCP/IP is twice larger than the header of a data frame
in FC. In addition, the maximum length of an FC data frame is larger than that in
Ethernet. But data cannot be shared between hosts in block storage. Some enterprise
workloads may require data or file sharing between different types of clients, and block
HCIA-Cloud Computing V5.0 Learning Guide Page 58

storage cannot do this. In addition, block storage is complex and costly because
additional components, such FC components and HBAs, need to be purchased.
File systems are deployed on file storage devices, and users access specific files, for
example, opening, reading from, writing to, or closing a file. File storage maps file
operations to disk operations, and users do not need to know the exact disk block where
the file resides. Data is exchanged between users and file storage over the Ethernet in a
LAN. File storage is easy to manage and supports comprehensive information access. One
can share files by simply connecting the file storage devices to a LAN. This makes file
sharing and collaboration more efficient. But file storage is not suitable for applications
that demand block devices, especially databases systems. This is because file storage
requires the conversion of file system format and users access specific files instead of
data.
Object storage uses a content addressing system to simplify storage management,
ensuring that the stored content is unique. It offers terabyte to petabyte scalability for
static data. When a data object is stored, the system converts the binary content of the
stored data to a unique identifier. The content address is not a simple mapping of the
directory, file name, or data type of the stored data. OBS ensures content reliability with
globally unique, location-independent identifiers and high scalability. It is good at storing
non-transactional data, especially static data and is applicable to archives, backups,
massive file sharing, scientific and research data, and digital media.

3.2 Key Storage Technologies


3.2.1 RAID Technology
3.2.1.1 What Is RAID
Redundant Array of Independent Disks (RAID) combines multiple physical disks into one
logical disk in different ways, improving read/write performance and data security. With
the development of RAID technology, RAID can be divided as seven basic levels (RAID 0
to RAID 6). In addition, there are some combinations of basic RAID levels, such as RAID
10 (combination of RAID 1 with RAID 0) and RAID 50 (combination of RAID 5 with RAID
0). Different RAID levels represent different storage performance, data security, and
storage costs.

3.2.1.2 RAID Data Organization Forms


HCIA-Cloud Computing V5.0 Learning Guide Page 59

Figure 3-14 Data organization forms of RAID


RAID divides space in each disk into multiple strips of a specific size. Written data is also
divided into blocks based on the strip size. The following concepts are involved:
Strip: A strip consists of one or more consecutive sectors in a disk, and multiple strips
form a stripe.
Stripe: A stripe consists of strips of the same location or ID on multiple disks in the same
array.
Stripe width indicates the number of disks used in an array for striping. For example, if a
disk array consists of three member disks, the stripe width is 3.
Stripe depth indicates the capacity of a strip.

3.2.1.3 RAID Data Protection Techniques


RAID generally protects data by the following methods:
⚫ Mirroring: Data copies are stored on another redundant disk, improving reliability
and read performance.
⚫ Parity check algorithm (XOR): Parity data is additional information calculated using
user data. For a RAID array that uses parity, an additional parity disk is required. The
XOR (symbol: ⊕) algorithm is used for parity.
XOR is widely used in digital electronics and computer science. XOR is a logical operation
that outputs true only when inputs differ (one is true, the other is false).
⚫ 0 ⊕ 0 = 0, 0 ⊕ 1 = 1, 1 ⊕ 0= 1, 1 ⊕ 1 = 0

Figure 3-15 XOR check

3.2.1.4 RAID Hot Spare and Reconstruction


If a disk in a RAID array fails, a hot spare is used to automatically replace the failed disk
to maintain the RAID array's redundancy and data continuity.
Hot spare is classified into the following types:
⚫ Global: The spare disk is shared by all RAID groups in the system.
⚫ Dedicated: The spare disk is used only by a specific RAID group in the system.
HCIA-Cloud Computing V5.0 Learning Guide Page 60

Figure 3-16 Hot spare and reconstruction


Data reconstruction: indicates a process of reconstructing data from a failed data disk to
the hot spare disk. Generally, the data parity mechanism in RAID is used to reconstruct
data.
Data parity: Redundant data is used to detect and rectify data errors. The redundant data
is usually calculated through Hamming check or XOR operations. Data parity can greatly
improve the reliability, performance, and error tolerance of the drive arrays. However, the
system needs to read data from multiple locations, calculate, and compare data during
the parity process, which affects system performance.
Generally, RAID cannot be used as an alternative to data backup. It cannot prevent data
loss caused by non-drive faults, such as viruses, man-made damages, and accidental
deletion. Data loss here refers to the loss of operating system, file system, volume
manager, or application system data, not the RAID data loss. Therefore, data protection
measures, such as data backup and disaster recovery, are necessary. They are
complementary to RAID, and can ensure data security and prevent data loss at different
layers.

3.2.1.5 Common RAID Levels


3.2.1.5.1 RAID 0

Figure 3-17 RAID 0 diagram


RAID 0 is a simple data striping technology without parity. In essence, RAID 0 is not a
real RAID, because it offers no redundancy. In RAID 0, disks are striped to form a large-
capacity storage space (as shown in Figure 3-17), data is distributed across all disks, and
reading data from multiple disks can be processed concurrently. RAID 0 allows I/O
operations to be performed concurrently, improving utilization of the bus bandwidth. In
addition, RAID 0 requires no data parity, thereby providing the highest performance. If a
HCIA-Cloud Computing V5.0 Learning Guide Page 61

RAID 0 group consists of n disks, theoretically, the read and write performance of the
group is n times that of a single disk. Due to the bus bandwidth restriction and other
factors, the actual performance is lower than the theoretical one.
RAID 0 features low cost, high read/write performance, and 100% disk usage. However, it
offers no redundancy. In the event of a disk failure, data is lost. Therefore, RAID 0 is
applicable to applications that have high requirements on performance but low
requirements on data security and reliability, such as video/audio storage and temporary
storage space.
3.2.1.5.2 RAID 1
RAID 1, also known as mirror or mirroring, is designed to maximize the availability and
repairability of user data. RAID 1 automatically copies all data written to one disk to the
other disk in a RAID group.
RAID 1 writes the same data to the mirror disk while storing the data on the source disk.
If the source disk fails, the mirror disk takes over services from the source disk. RAID 1
delivers the best data security among all RAID levels because the mirror disk is used for
data backup. However, no matter how many disks are used, the available storage space
is only the capacity of a single disk. Therefore, RAID 1 delivers the lowest disk usage
among all RAID levels.

Figure 3-18 RAID 1 diagram


Figure 3-18 shows the diagram of RAID 1. There are two disks, Disk 1 and Disk 2. RAID 1
stores the data (D1, D2...) in the primary disk (Disk 1), and then stores the data again in
Disk 2 for data backup.
RAID 1 is the highest in unit storage cost among all RAID levels. However, it delivers the
highest data security and availability. RAID 1 is applicable to online transaction
processing (OLTP) applications with intensive read operations and other applications that
require high read/write performance and reliability, for example, email, operating system,
application file, and random access environment.
HCIA-Cloud Computing V5.0 Learning Guide Page 62

3.2.1.5.3 RAID 3

Figure 3-19 RAID 3 diagram


RAID 3 is a parallel access array that uses one disk as the parity disk and other disks as
data disks. Data is stored to each data disk by bit or byte. RAID 3 requires at least three
disks. XOR check is performed for data in the same stripe on different disks, and the
parity data is written into the parity disk. The read performance of a complete RAID 3
group is the same as that of a RAID 0 group. Data is concurrently read from multiple disk
strips, providing high performance and data fault tolerance. In RAID 3 level, when data is
written, the system must read all data blocks in the same stripe to calculate a check
value and write the new value to the parity disk. The write operation involves four
operations: writing a data block, reading data blocks in the same stripe, calculating a
check value, and writing the check value. As a result, the system overhead is high and the
performance decreases.
If a disk in RAID 3 is faulty, data reading is not affected. The system reconstructs the data
based on the parity data and other intact data. If the data block to be read is located on
the faulty disk, the system reads all data blocks in the same strip and reconstructs the
lost data based on the parity value. As a result, the system performance decreases. After
the faulty disk is replaced, the system reconstructs the data on the faulty disk to the new
disk in the same way.
RAID 3 requires only one parity disk. The disk usage is high. In addition, concurrent access
delivers high performance for a large number of read and write operations with high
bandwidth. RAID 3 applies to applications that require sequential access to large
amounts of data, such as image processing and streaming media services. Currently, the
RAID 5 algorithm is continuously improved to simulate RAID 3 when a large amount of
data is read. In addition, the performance of RAID 3 deteriorates greatly when a disk is
faulty. Therefore, RAID 5 is often used to replace RAID 3 to run applications that feature
continuous, high bandwidth, and a large number of read and write operations.
HCIA-Cloud Computing V5.0 Learning Guide Page 63

3.2.1.5.4 RAID 5

Figure 3-20 RAID 5 diagram


RAID 5 is a compromise between RAID 0 and RAID 1. RAID 5 offers slower write speeds
due to the parity check information but could offer the same read performance as RAID
0. In addition, RAID 5 offers higher disk usage and lower storage costs than RAID 1
because multiple data records of RAID 5 share the same parity check information. It is
widely used at present.
In a RAID 5 group, data and associated parity check information are stored on the
member disks. To be specific, the capacity of N - 1 disks is used to store the data, and the
capacity of one disk is used to store the parity check information ( N indicates the number
of disks). Therefore, if a disk in RAID 5 is damaged, data integrity is not affected,
ensuring data security. After a damaged disk is replaced, RAID 5 automatically
reconstructs data on the faulty disk based on the parity check information, ensuring high
reliability.
The available capacities of all the disks in a RAID 5 group must be the same. If not, the
available capacity depends on the smallest one. It is recommended that the rotational
speeds of the disks be the same. Otherwise, the performance is affected. In addition, the
available space is equal to the space of N – 1 disks. RAID 5 has no independent parity
disk, so the parity information is distributed across all disks, occupying the capacity of
one disk.
In RAID 5, disks stores data and parity data. Data blocks and associated check
information are stored on different disks. If one data disk is faulty, the system
reconstructs data on the faulty disk based on data blocks and associated check
information on other disks in the same strip. Like other RAID levels, the performance of
RAID 5 is greatly affected during data reconstruction.
RAID 5 is a storage protection solution that balances storage performance, data security,
and storage cost. It can be considered as a compromise between RAID 0 and RAID 1.
RAID 5 can meet most storage application requirements. Most data centers adopt RAID 5
as the protection solution for application data.
HCIA-Cloud Computing V5.0 Learning Guide Page 64

3.2.1.5.5 RAID 6
RAID 6 breaks through the limitation of disk redundancy.

Figure 3-21 RAID 6 DP diagram


In the past, there was a low probability that two disks were faulty at the same time.
However, due to increase in capacity and density of FC and SATA disks, RAID 5
reconstruction needs longer time, the risk that two disks are faulty at the same time also
increases greatly. Enterprise-level storage must attach great importance to this risk.
Therefore, RAID 6 is introduced.
The RAID levels described in the previous sections only protect data loss caused by the
failure of a single disk. If two disks are faulty at the same time, data cannot be restored.
As shown in Figure 3-21, RAID 6 adopts double parity to prevent data loss in the event of
simultaneous failure of two disks, ensuring service continuity. RAID 6 is designed based
on RAID 5 to further enhance the data security. It is actually an extended RAID 5 level.
RAID 6 must support the recovery of both actual data and parity data and the RAID
controller design is more complicated. As a result, RAID 6 is more expensive than other
RAID levels. In most cases, RAID 6 can be implemented by using two independent parity
columns. Parity data can be stored on two different parity disks or distributed across all
member disks. If two disks fail at the same time, the data on the two disks can be
reconstructed by solving the equation with two unknowns.
Alternatively, RAID 6 can be implemented by using double parity (DP).
⚫ RAID 6 DP also has two independent parity data blocks. Parity values in the
horizontal parity disk are also called parity check values, which are obtained by
performing the XOR operation on user data in the same stripe. As shown in Figure 3-
21, P0 is obtained by performing an XOR operation on D0, D1, D2, and D3 in stripe 0,
and P1 is obtained by performing an XOR operation on D4, D5, D6, and D7 in stripe
1. Therefore, P0 = D0 ⊕ D1 ⊕ D2 ⊕ D3, P1 = D4 ⊕ D5 ⊕ D6 ⊕ D7, and so on.
⚫ The diagonal parity uses the diagonal XOR operation to obtain the row-diagonal
parity data block. A process of selecting a data block is relatively complex. DP0 is
obtained by performing an exclusive OR operation on D0 on a stripe 0 of a hard disk
1, D5 on a stripe 1 of a hard disk 2, D10 on a stripe 2 of a hard disk 3, and D15 on a
stripe 3 of a hard disk 4. DP1 is obtained by performing an exclusive OR operation on
D1 on a stripe 0 of a hard disk 2, D6 on a stripe 1 of a hard disk 3, D11 on a stripe 2
of a hard disk 4, and P3 on a stripe 3 of a first parity hard disk. DP2 is obtained by
performing an exclusive OR operation on D2 on a stripe 0 of a hard disk 3, D7 on a
stripe 1 of a hard disk 4, P2 on a stripe 2 of an odd even hard disk, and D12 on a
stripe 3 of a hard disk 1. Therefore, DP0 = D0⊕D5⊕D10⊕D15, DP1 = D1⊕D6⊕D11
⊕P3, and so on.
HCIA-Cloud Computing V5.0 Learning Guide Page 65

RAID 6 features fast read performance and high fault tolerance. However, the cost of
RAID 6 is much higher than that of RAID 5, the write performance is poor, and the design
and implementation are complicated. Therefore, RAID 6 is seldom used and is mainly
applicable to scenarios that require high data security. It can be used as an economical
alternative to RAID 10.

3.2.1.6 Introduction to RAID 2.0


⚫ RAID 2.0
RAID 2.0 is an enhanced RAID technology that effectively resolves the following
problems: prolonged reconstruction of an HDD, and data loss if a disk is faulty during the
long reconstruction of a traditional RAID group.
⚫ RAID 2.0+
RAID 2.0+ provides smaller resource granularities (tens of KB) than RAID 2.0 to serve as
the units of standard allocation and reclamation of storage resources, similar to VMs in
computing virtualization. This technology is called virtual block technology.
⚫ Huawei RAID 2.0+
Huawei RAID 2.0+ is a brand-new RAID technology developed by Huawei to overcome
the disadvantages of traditional RAID and keep in line with the storage architecture
virtualization trend. RAID 2.0+ implements two-layer virtualized management instead of
the traditional fixed management. Based on the underlying disk management that
employs block virtualization (Virtual for Disk), RAID 2.0+ uses Smart-series efficiency
improvement software to implement efficient resource management that features upper-
layer virtualization (Virtual for Pool). Block virtualization is to divide disks into multiple
contiguous storage spaces of a fixed size called a chunk (CK).

3.2.1.7 RAID 2.0+ Block Virtualization

Figure 3-22 Working principles of RAID 2.0+ block virtualization


⚫ The working principles of RAID 2.0+ block virtualization are as follows:
1. Multiple SSDs form a storage pool.
HCIA-Cloud Computing V5.0 Learning Guide Page 66

2. Each SSD is then divided into CKs of a fixed size (typically 4 MB) for logical space
management.
3. CKs from different SSDs form chunk groups (CKGs) based on the RAID policy
specified on DeviceManager.
4. CKGs are further divided into grains (typically 8 KB). Grains are mapped to LUNs for
refined management of storage resources.
⚫ RAID 2.0+ outperforms traditional RAID in the following aspects:
- Service load balancing to avoid hot spots: Data is evenly distributed to all disks in
the resource pool, protecting disks from early end of service lives due to excessive
writes.
- Fast reconstruction to reduce risk window: When a disk fails, the valid data in the
faulty disk is reconstructed to all other functioning disks in the resource pool
(fast many-to-many reconstruction), efficiently resuming redundancy protection.
- Reconstruction load balancing among all disks: All member disks in a storage
resource pool participate in reconstruction, and each disk only needs to
reconstruct a small amount of data. Therefore, the reconstruction process does
not affect upper-layer applications.

3.2.2 Storage Protocol


3.2.2.1 SCSI Protocol

Figure 3-23 SCSI protocol


Computers communicate with storage systems through buses. The bus is a path through
which data is transferred from the source device to the target device. To put it simple,
the high-speed cache of the controller functions as the source device and transfers data
to target disks, which serve as the target devices. The controller sends a signal to the bus
processor requesting to use the bus. After the request is accepted, the controller's high-
speed cache sends data. During this process, the bus is occupied by the controller and
other devices connected to the same bus cannot use it. However, the bus processor can
interrupt the data transfer at any time and allow other devices to use the bus for
operations of a higher priority.
A computer has numerous buses, which are like high-speed channels used for
transferring information and power from one place to another. For example, the
HCIA-Cloud Computing V5.0 Learning Guide Page 67

universal serial bus (USB) port is used to connect an MP3 player or digital camera to a
computer. The USB port is competent to the data transfer and charging of portable
electronic devices that store pictures and music. However, the USB bus is incapable of
supporting computers, servers, and many other devices.
In this case, SCSI buses are applicable. SCSI, short for Small Computer System Interface, is
an interface used to connect between hosts and peripheral devices including disk drives,
tape drives, CD-ROM drives, and scanners. Data operations are implemented by SCSI
controllers. Like a small CPU, the SCSI controller has its own command set and cache.
The special SCSI bus architecture can dynamically allocate resources to tasks run by
multiple devices in a computer. In this way, multiple tasks can be processed at the same
time.
SCSI is a vast protocol system evolved from SCSI-1 to SCSI-2 and then to SCSI-3. It
defines a model and a necessary command set for different devices (such as disks,
processors, and network devices) to exchange information using the framework.

3.2.2.2 iSCSI Protocol


iSCSI encapsulates SCSI commands and block data into TCP packets and transmits the
packets over an IP network. iSCSI uses mature IP network technologies to implement and
extend SANs.

Figure 3-24 iSCSI protocol


The SCSI controller card is used to connect to multiple devices to form a network, but the
devices can communicate with each other on the network and cannot be shared on the
Ethernet. If devices form a network through SCSI and the network can be mounted to an
Ethernet, the devices can interconnect and share with other devices as network nodes. As
a result, the iSCSI protocol evolved from SCSI. The IP SAN using iSCSI converts user
requests into SCSI codes and encapsulates data into IP packets for transmission over the
Ethernet.
The iSCSI scheme was initiated by Cisco and IBM and then advocated by Adaptec, Cisco,
HP, IBM, Quantum, and other companies. iSCSI offers a way of transferring data through
TCP and saving data on SCSI devices. The iSCSI standard was drafted in 2001 and
submitted to IETF in 2002 after numerous arguments and modifications. In Feb. 2003, the
iSCSI standard was officially released. The iSCSI technology inherits advantages of
traditional technologies and develops based on them. On one hand, SCSI technology is a
storage standard widely applied by storage devices including disks and tapes. It has been
keeping a fast development pace since 1986. On the other hand, TCP/IP is the most
HCIA-Cloud Computing V5.0 Learning Guide Page 68

universal network protocol and IP network infrastructure is mature. The two points
provide a solid foundation for iSCSI development.
Prevalent IP networks allow data to be transferred over LANs, WANs, or the Internet
using new IP storage protocols. The iSCSI protocol is developed by this philosophy. iSCSI
adopts IP technical standards and converges SCSI and TCP/IP protocols. Ethernet users
can conveniently transfer and manage data with a small investment.
3.2.2.2.1 iSCSI Initiator and Target

Figure 3-25 iSCSI initiator and target


The iSCSI communication system inherits some of SCSI's features. The iSCSI
communication involves an initiator that sends I/O requests and a target that responds to
the I/O requests and executes I/O operations. After a connection is set up between the
initiator and target, the target controls the entire process as the primary device.
⚫ There are three types of iSCSI initiators: software-based initiator driver, hardware-
based TCP offload engine (TOE) NIC, and iSCSI HBA. Their performance increases in
that order.
⚫ iSCSI targets include iSCSI disk arrays and iSCSI tape libraries.
The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators
and targets. All iSCSI nodes are identified by their iSCSI names. This method distinguishes
iSCSI names from host names.
iSCSI uses iSCSI names to identify initiators and targets. Addresses change with the
relocation of initiator or target devices, but their names remain unchanged. When setting
up a connection, an initiator sends a request. After the target receives the request, it
checks whether the iSCSI name contained in the request is consistent with that bound
HCIA-Cloud Computing V5.0 Learning Guide Page 69

with the target. If the iSCSI names are consistent, the connection is set up. Each iSCSI
node has a unique iSCSI name. One iSCSI name can be used in the connections from one
initiator to multiple targets. Multiple iSCSI names can be used in the connections from
one target to multiple initiators.
The functions of the iSCSI initiator and target are as follows:
⚫ Initiator
The SCSI layer generates command descriptor blocks (CDBs) and transfers them to the
iSCSI layer.
The iSCSI layer generates iSCSI protocol data units (PDUs) and sends them to the target
over an IP network.
⚫ Target
The iSCSI layer receives PDUs and sends CDBs to the SCSI layer.
The SCSI layer interprets CDBs and gives responses when necessary.

3.2.2.3 Convergence of Fibre Channel and TCP


Ethernet technologies and Fibre Channel technologies are both developing fast.
Therefore, it is inevitable that IP SAN and FC SAN that are complementary coexist for a
long time.
The following protocols use a TCP/IP network to carry FC channels:
⚫ Internet Fibre Channel Protocol (iFCP) is a gateway-to-gateway protocol that
provides Fibre Channel communication services for optical devices on TCP/IP
networks. iFCP delivers congestion control, error detection, and recovery functions
through TCP. The purpose of iFCP is to enable current Fibre Channel devices to
interconnect and network at the line rate over an IP network. The frame address
conversion method defined in this protocol allows Fibre Channel storage devices to
be added to the IP-based network through transparent gateways.
⚫ Fibre Channel over Ethernet (FCoE) transmits Fibre Channel signals over an Ethernet,
so that Fibre Channel data can be transmitted at the backbone layer of a 10 Gbit/s
Ethernet using the Fibre Channel protocol.
3.2.2.3.1 iFCP Protocol
HCIA-Cloud Computing V5.0 Learning Guide Page 70

Figure 3-26 iFCP protocol


iFCP is a gateway-to-gateway protocol that provides Fiber Channel communication
services for Fibre Channel devices on an TCP/IP network to implement end-to-end IP
connection. Fibre Channel storage devices, HBAs, and switches can directly connect to
iFCP gateways. iFCP provides traffic control, error detection, and error recovery through
TCP. It enables Fibre Channel devices to interconnect and network at the line rate over an
IP network.
The frame address conversion method defined in the iFCP protocol allows Fibre Channel
storage devices to be added to the TCP/IP-based network through transparent gateways.
iFCP can replace Fibre Channel to connect to and group Fibre Channel devices using iFCP
devices. However, iFCP does not support the merge of independent SANs, and therefore a
logical SAN cannot be formed. iFCP outstands in supporting SAN interconnection as well
as gateway zoning, allowing fault isolation and breaking the limitations of point-to-point
tunnels. In addition, it enables end-to-end connection between Fibre Channel devices. As
a result, the interruption of TCP connection affects only a communication pair. SANs that
adopt iFCP support fault isolation and security management, and deliver higher reliability
than SANs that adopt FCIP.
3.2.2.3.2 iFCP Protocol Stack

Figure 3-27 IFCP protocol stack


Fibre Channel only allows data to be transmitted locally, while iFCP enables data to be
transmitted over an IP network and remotely transmitted across WANs through routers
by encapsulating IP headers. In this way, enterprise users can use the existing storage
devices and network architecture to share storage resources with more applications,
breaking the geographical limitations of the traditional DAS and SAN architectures
without changing the existing storage protocols.
3.2.2.3.3 FCoE Protocol
Fibre Channel over Ethernet (FCoE) allows the transmission of LAN and FC SAN data on
the same Ethernet link. This reduces the number of devices, cables, and network nodes in
a data center, as well as power consumption and cooling loads, simplifying management.
FCoE encapsulates FC data frames in Ethernet frames and allows service traffic on a LAN
and SAN to be transmitted over the same Ethernet.
HCIA-Cloud Computing V5.0 Learning Guide Page 71

From the perspective of Fibre Channel, FCoE enables Fibre Channel to be carried by the
Ethernet Layer 2 link. From the perspective of the Ethernet, FCoE is an upper-layer
protocol that the Ethernet carries, like IP or IPX.
3.2.2.3.4 FCoE Protocol Encapsulation

Figure 3-28 FCoE protocol encapsulation


The Fibre Channel protocol stack has five layers. FC-0 defines the medium type, FC-1
defines the frame coding and decoding mode, FC-2 defines the frame division protocol
and flow control mechanism, FC-3 defines general services, and FC-4 defines the
mapping from upper-layer protocols to Fibre Channel. FCoE encapsulates contents in the
FC-2 and above layers into Ethernet packets for transmission.

3.3 Quiz
What are the relationships between DAS, NAS, SAN, block storage, file storage, and
object storage?

You might also like