Zfs Tech Talk
Zfs Tech Talk
https://en.wikipedia.org/wiki/ZFS http://open-zfs.org/wiki/History
Mục Lục
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
Traditional Caches
• The way the LRU algorithm works, is when an application reads data blocks,
they are put into the cache. The cache will fill as more and more data is read,
and put into the cache.
• However, the cache is a FIFO (first in, first out) algorithm. Thus, when the
cache is full, the older pages will be pushed out of the cache. Even if those
older pages are accessed more frequently. Think of the whole process as a
conveyor belt. Blocks are put into most recently used portion of the cache.
• As more blocks are read, the push the older blocks toward the least recently
used portion of the cache, until they fall off the conveyor belt, or in other
words are evicted.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
Traditional Caches
• When large sequential reads are read from disk, and placed into the cache, it
has a tendency to evict more frequently requested pages from the cache. Even
if this data was only needed once. Thus, from the cache perspective, it ends up
with a lot of worthless, useless data that is no longer needed. Of course, it's
eventually replaced as newer data blocks are requested.
• There do also exist least frequently used (LFU) caches. However, they suffer
from the problem that newer data could be evicted from the cache if it's not
read frequently enough. Thus, there are a great amount of disk requests, which
kind of defeats the purpose of a cache to begin with. So, it would seem that the
obvious approach would be to somehow combine the two- have an LRU and
an LFU simultaneously.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ZFS ARC
• The ZFS adjustable replacement cache (ARC) is one such caching
mechanism that caches both recent block requests as well as frequent
block requests. It is an implementation of the patented IBM adaptive
replacement cache, with some modifications and extensions.
• Adjustable Replacement Cache, or ARC - A cache residing in physical
RAM. It is built using two caches - the most frequently used(MFU)
cached and the most recently used(MRU) cache. A cache directory
indexes pointers to the caches, including pointers to disk called the
ghost frequently used cache, and the ghost most recently used cache.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ZFS ARC
• Cache Directory - An indexed directory of pointers making up the MRU, MFU,
ghost MRU and ghost MFU caches.
• MRU Cache - The most recently used cache of the ARC. The most recently
requested blocks from the filesystem are cached here.
• MFU Cache - The most frequently used cache of the ARC. The most frequently
requested blocks from the filesystem are cached here.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ZFS ARC
• Ghost MRU- Evicted pages from the MRU cache back to disk to save space in
the MRU. Pointers still track the location of the evicted pages on disk.
• Ghost MFU- Evicted pages from the MFU cache back to disk to save space in
the MFU. Pointers still track the location of the evicted pages on disk.
• Level 2 Adjustable Replacement Cache, or L2ARC- A cache residing outside of
physical memory, typically on a fast SSD. It is a literal, physical extension of the
RAM ARC.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ARC Algorithm
• This is a simplified version of how the IBM ARC works, but it should help you
understand how priority is placed both on the MRU and the MFU. First, let's
assume that you have eight pages in your cache. Four pages in your cache will
be used for the MRU and four pages for the MFU. Further, there will also be four
pointers for the ghost MRU and four pointers for the ghost MFU. As such, the
cache directory will reference 16 pages of live or evicted cache.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ARC Algorithm
1. As would be expected, when block A is read from the filesystem, it will be cached
in the MRU. An index pointer in the cache directory will reference the the MRU
page.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ARC Algorithm
2. Suppose now a different block (block B) is read from the filesystem. It too will be
cached in the MRU, and an index pointer in the cache directory will reference the
second MRU page. Because block B was read more recently than block A, it gets
higher preference in the MRU cache than block A. There are now two pages in the
MRU cache.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ARC Algorithm
3. Now suppose block A is read again from the filesystem. This would be two reads
for block A. As a result, it has been read frequently, so it will be store in the MFU. A
block must be read at least twice to be stored here. Further, it is also a recent
request. So, not only is the block cached in the MFU, it is also referenced in the
MRU of the cache directory. As a result, although two pages reside in cache, there
are three pointers in the cache directory pointing to two blocks in the cache.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ARC Algorithm
4. Eventually, the cache is filled with the above steps, and we have pointers in the
MRU and the MFU of the cache directory.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ARC Algorithm
5. Here's where things get interesting. Suppose we now need to read a new block
from the filesystem that is not cached. Because of the pigeon hole principle, we
have more pages to cache than we can store. As such, we will need to evict a page
from the cache. The oldest page in the MRU (referred to as the Least Recently
Used- LRU) gets the eviction notice, and is referenced by the ghost MRU. A new
page will now be available in the MRU for the newly read block.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ARC Algorithm
6. After the newly read block is read from the filesystem, as expected, it is stored in
the MRU and referenced accordingly. Thus, we have a ghost MRU page reference,
and a filled cache.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ARC Algorithm
7. Just to throw a monkey wrench into the whole process, let us suppose that the
recently evicted page is re-read from the filesystem. Because the ghost MRU
knows it was recently evicted from the cache, we refer to this as "a phantom cache
hit". Because ZFS knows it was recently cached, we need to bring it back into the
MRU cache; not the MFU cache, because it was not referenced by the MFU ghost.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The ARC Algorithm
8. Unfortunately, our cache is too small to store the page. So, we must grow the
MRU by one page to store the new phantom hit. However, our cache is only so
large, so we must adjust the size of the MFU by one to make space for the MRU.
Of course, the algorithm works in a similar manner on the MFU and ghost MFU.
Phantom hits for the ghost MFU will enlarge the MFU, and shrink the MRU to make
room for the new page.https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
ZFS Storage Data in memory
The L2ARC
The level 2 ARC, or L2ARC should be fast disk. As mentioned in my previous post
about the ZIL, this should be DRAM DIMMs (not necessarily battery-backed), a fast
SSD, or 10k+ enterprise SAS or FC disk. If you decide to use the same device for
both your ZIL and your L2ARC, which is certainly acceptable, you should partition it
such that the ZIL takes up very little space, like 512 MB or 1 GB, and give the rest
to the pool as a striped (RAID-0) L2ARC. Persistence in the L2ARC is not needed,
as the cache will be wiped on boo.
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
Mục Lục
SAS HBA
Kiến trúc phần cứng triển khai ZFS SDS
Hardware
https://www.broadcom.com/products/storage/host-bus-adapters/sas-9207-8i#downloads
ZFS Storage Data in memory
Hardware
https://www.broadcom.com/products/storage/host-bus-adapters/sas-9201-16e#downloads
Kiến trúc phần cứng triển khai ZFS SDS
Hardware
https://www.broadcom.com/products/storage/host-bus-adapters/sas-9202-16e
Kiến trúc phần cứng triển khai ZFS SDS
Hardware Models: 9400-16i | 9400-8i | 9400-16e | 9400-8e
HBAs 9400-16e
HBA 9405W-16E
HBAs with Tri-Mode SerDes SAS3 12Gbps PCIex 3.0
https://www.broadcom.com/products/storage/host-bus-adapters/sas-nvme-9405w-16e
Kiến trúc phần cứng triển khai ZFS SDS
Hardware
https://www.broadcom.com/products/storage/host-bus-adapters/sas-nvme-9405w-16e
Kiến trúc phần cứng triển khai ZFS SDS
Hardware
https://netbergtw.com/products/aeon-j470-m3/
Kiến trúc phần cứng triển khai ZFS SDS
Hardware
FC HBA
Fibre chanel Host bus adapter
https://www.broadcom.com/products/storage/host-bus-adapters/sas-nvme-9405w-16e
Kiến trúc phần cứng triển khai ZFS SDS
Hardware FC HBA 4Gbps
NETWORK I/O
Fibre chanel Host bus adapter
https://www.broadcom.com/products/storage/host-bus-adapters/sas-nvme-9405w-16e
Kiến trúc phần cứng triển khai ZFS SDS
Hardware
Dell Intel X520-DA2 10Gb SFP+ Dual Port Full Height Network Card - XYT17
Kiến trúc phần cứng triển khai ZFS SDS
Hardware
Hardware
Hardware
Hardware
Hardware
https://www.broadcom.com/products/storage/host-bus-adapters/sas-nvme-9405w-16e
Mục Lục
Raid 5 được tính toán parity trước khi lưu dữ liệu lên ổ cứng, dữ liệu và parity
được phân bố đều trên tất cả các ổ. Tối thiểu 03 ổ khởi tạo raid5, cho fail 1 ổ
Kiến trúc pool disk trên ZFS Storage
Raid 6 có 02 parity được tính toán cho mõi lần ghi, và được chia đều parity và
dữ liệu trên các ổ, cho phép chết 2 ổ bất kỳ, an toàn dữ liệu hơn raid5
Kiến trúc pool disk trên ZFS Storage
Raid 60 có đặc điểm của raid 6 và tăng tốc, mở rộng nhờ raid0. Cho phép chết
02 ổ
Kiến trúc pool disk trên ZFS Storage
Disk pool cho phép quản lý linh hoạt tầng đĩa ở bên dưới trên cùng 1 hệ thống
có thể có nhiều loại đĩa SSD Cache, SSD pool, FC SAS pool, NL-SAS pool
Với hệ thống ZFS quản lý đĩa dạng hybrid-pool: tăng cache Level 1 RAM, Cache
Level 2 ZIL là SSD write intensive, Cache L2 read SSD read intensive
Mục Lục
Phần cứng cơ bản đáp ứng cài đặt hệ điều hành Oracle Solaris( xem HCL)
CPU intel xeon x86_64 từ E5-2600 v1, v2, v3, v4, scaleable…
RAM tối thiểu 16GB
Card SAS HBA đồng bộ với SAS Expander và HDD SAS: SAS2, SAS3
Card Dual/Quad FC HBA nếu muốn kết nối storage bằng FC(emulex/qlogic)
Card Intel Dual/Quad port 10Gbps kết nối bằng iSCSI
Backplane hoặc disk enclosure hỗ trợ ổ cứng SAS2, SAS3 hot swap / hot plug
Ổ cứng có firmware hỗ trợ tính năng S.M.A.R.T(Self-Monitoring, Analysis and
Reporting Technology; often written as SMART)
HĐH đã hỗ trợ SAS2/SAS3 SSD và NVMe PCIex SSD
ZFS trên các hệ điều hành phổ biến A - Z
Install Oracle Solaris 11.3 x86_64
Cấu hình FC Target Với card Emulex Cấu hình FC Target Với card qlogic
# vim +/target-mode /kernel/drv/emlxs.conf # vim +/qlc /etc/driver_aliases
Chuyển từ Chuyển từ
target-mode=0; qlc "pciex1077,2432“
Thành Thành
target-mode=1; qlt "pciex1077,2432"
ZFS trên các hệ điều hành phổ biến A - Z
Config Oracle Solaris 11.3 x86_64
Với các hệ thống sử dụng ổ cứng SSD hoặc ổ cứng hỗ trợ Advance format
4kn kiểm tra bằng
# devprop -n /dev/rdsk/c1t0d0s0 device-blksize device-pblksize
Khai báo các thông số ổ cứng trong
# vim /kernel/drv/sd.conf ( cho cuối file)
https://docs.oracle.com/cd/E36784_01/html/E36834/gizbc.html
http://www.oracle.com/technetwork/articles/servers-storage-admin/solaris-advanced-format-disks-2344966.html
update_drv -vf sd
echo ::sd_state | mdb -k | egrep '(^un|_blocksize)'
ZFS trên các hệ điều hành phổ biến A - Z
Config Oracle Solaris 11.3 x86_64
Điều chỉnh các tham số cho hệ thống và zfs được khai báo trong
# vim /etc/system
set zfs:zfs_default_bs=0xc >> ashift = 12( 4kn, 2^12)
set zfs:zfs_write_limit_override=0x30000000 >> tăng theo RAM
set zfs:zfs_write_limit_max=0x200000000 >> tăng theo RAM
set zfs:arc_reduce_dnlc_percent=0x2 >> giảm theo RAM 2 với 192GB
set zfs:zfs_no_write_throttle=0x1 >> on/off
set zfs:zfs_nocacheflush=0x1 >> không xóa cache
set zfs:zfs_vdev_cache_bshift=0xd >> 2^13 = 16k
set zfs:zfs_vdev_cache_size=0x1000000 >> 10MB
set zfs:zfs_vdev_aggregation_limit=0x20000 >> mặc định 32k
set zfs:zfs_unmap_ignore_size=0 >> disable unmap
https://docs.oracle.com/cd/E36784_01/html/E36834/gizbc.html
ZFS trên các hệ điều hành phổ biến A - Z
Config Oracle Solaris 11.3 x86_64
Sau khi reboot xong hệ điều hành kiểm tra mode của card FC xem là target chưa
# prtconf -D | grep qlt
# stmfadm list-target -v
# stmfadm list-state
Một số lệnh kiểm tra hệ thống trước khi cấu hình server làm storage(SAN)
# echo | format
# diskinfo -o Dn
# fmadm –faulty
# dmesg
# tail -500f /var/adm/messages
…….
ZFS trên các hệ điều hành phổ biến A - Z
Config Oracle Solaris 11.3 x86_64
https://www.itfromallangles.com/2011/08/adventures-in-zfs-configuring-fibre-channel-targe
ZFS trên các hệ điều hành phổ biến A - Z
Config Oracle Solaris 11.3 x86_64
https://www.itfromallangles.com/2011/08/adventures-in-zfs-configuring-fibre-channel-targe
ZFS trên các hệ điều hành phổ biến A - Z
ZFS on Omniosce
ZFS trên các hệ điều hành phổ biến A - Z
Config OmniOS Community Edition r151026
https://docs.oracle.com/cd/E36784_01/html/E36834/gizbc.html
ZFS trên các hệ điều hành phổ biến A - Z
Config OmniOS Community Edition r151026
Điều chỉnh các tham số cho hệ thống và zfs được khai báo trong
# vim /etc/system
set zfs:zfs_default_bs=0xc >> ashift = 12( 4kn, 2^12)
set zfs:zfs zfs_arc_min=738197504 # 1 / 16 of arc max
set zfs:zfs zfs_arc_max=11811160064 # 11G / total 12G mem
set zfs:zfs zfs_immediate_write_sz=402653184
set zfs:zfs zfs_vdev_async_write_max_active=32
set zfs:zfs zfs_vdev_async_write_min_active=10
set zfs:zfs zfs_vdev_async_read_max_active=16
set zfs:zfs zfs_vdev_async_read_min_active=16
ZFS trên các hệ điều hành phổ biến A - Z
Config OmniOS Community Edition r151026
Phần cứng cơ bản đáp ứng cài đặt HĐH RHEL7, Ubuntu 16.0.4.3( xem HCL)
CPU intel xeon x86_64 từ E5-2600 v1, v2, v3, v4, scaleable…
RAM tối thiểu 16GB
Card SAS HBA đồng bộ với SAS Expander và HDD SAS: SAS2, SAS3
Card Dual/Quad FC HBA nếu muốn kết nối storage bằng FC(emulex/qlogic)
Card Intel Dual/Quad port 10Gbps kết nối bằng iSCSI
Backplane hoặc disk enclosure hỗ trợ ổ cứng SAS2, SAS3 hot swap / hot plug
SAS HDD có firmware hỗ trợ tính năng S.M.A.R.T(Self-Monitoring, Analysis and
Reporting Technology; often written as SMART)
OS Linux đã hỗ trợ SAS2/SAS3 SSD TRIM và NVMe PCIex SSD
ZFS trên các hệ điều hành phổ biến A - Z
Fedoda 28 Storage ZFS
# vim /etc/modprobe.d/zfs.conf
options zfs zfs_arc_min=738197504 # 1 / 16 of arc maxoptions zfs
zfs_arc_max=11811160064 # 11G / total 12G mem
options zfs zfs_immediate_write_sz=402653184
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_async_write_min_active=10
options zfs zfs_vdev_async_read_max_active=16
options zfs zfs_vdev_async_read_min_active=16
# dracut -f -v
ZFS trên các hệ điều hành phổ biến A - Z
Fedoda 28 Storage ZFS
# modinfo zfs
# cp /root/fedora-updates.repo /etc/yum.repo.d/
# yum remove targetcli
# yum install targetcli
# systemctl enable target.service
# systemctl start target.service
# systemctl status target.service
# targetcli
ZFS trên các hệ điều hành phổ biến A - Z
Fedoda 28 Storage ZFS
Phần cứng cơ bản đáp ứng cài đặt FreeBSD 11( xem HCL)
CPU intel xeon x86_64 từ E5-2600 v1, v2, v3, v4, scaleable…
RAM tối thiểu 16GB
Card SAS HBA đồng bộ với SAS Expander và HDD SAS: SAS2, SAS3
Card Dual/Quad FC HBA nếu muốn kết nối storage bằng FC(emulex/qlogic)
Card Intel Dual/Quad port 10Gbps kết nối bằng iSCSI
Backplane hoặc disk enclosure hỗ trợ ổ cứng SAS2, SAS3 hot swap / hot plug
SAS HDD có firmware hỗ trợ tính năng S.M.A.R.T(Self-Monitoring, Analysis and
Reporting Technology; often written as SMART)
OS Linux đã hỗ trợ SAS2/SAS3 SSD TRIM và NVMe PCIex SSD
ZFS trên các hệ điều hành phổ biến A - Z
Hardware
Ưu điểm
Performance cực tốt, cài đặt dễ dàng
Quản trị giao diện web
Hỗ trợ replication giữa 2 storage cùng phiên bản
Có hỗ trợ FC Target dạng DAS
Hỗ trợ iSCSI
Hỗ trợ tốt VAAI, làm việc tốt với VMware
Nhược điểm
Thi thoảng lỗi kết nối với host ESXi
ZFS trên các hệ điều hành phổ biến A - Z
http://www.openstoragenas.com/TrueNAS-Z30-HA.asp
Mục Lục