Dear Rodovidians, please, help us cover the costs of Rodovid.org web hosting until the end of 2025.
ZFS, MySQL & InnoDB
From Rodovid Engine
[edit] innodb_flush_method = O_DIRECT (ok)
InnoDB начинает работать с файлами данных и логами напрямую, не используя буферы файловой системы. Позволяет избежать двойной буферизации при работе с файлами данных и логов.
[edit] innodb_doublewrite = false (ok)
? skip-innodb_doublewrite Механизм буфера двойной записи (doublewrite) используется в InnoDB для того, чтобы удостовериться, что измененные страницы были записаны в файл данных. Этот механизм позволяет избежать потери данных при внезапном отключении сервера. Поскольку у ZFS есть свой механизм обеспечения целостности данных, буфер двойной записи можно отключить: Innodb uses a double write buffer for safely updating pages in a tablespace. Innodb first writes the changes to the double write buffer before updating the data page. This is to prevent partial writes. Since ZFS does not allow partial writes, you can safely turn off the double write buffer. In our tests with Sysbench read-write, we say a 5% improvement in performance.
[edit] zfs set recordsize=16k tank/db (ok)
Match ZFS recordsize with Innodb page size (16KB for Datafiles, and 128KB for Innodb log files).
Why:
The biggest boost in performance can be obtained by matching the ZFS record size with the size of the IO. Since a Innodb Page is 16KB in size, most read IO is of size 16KB (except for some prefetch IO which can get coalesced). The default recordsize for ZFS is 128KB. The mismatch between the read size and the ZFS recordsize can result in severely inflated IO. If you issue a 16KB read and the data is not already there in the ARC, you have to read 128KB of data to get it. ZFS cannot do a small read because the checksum is calculated for the whole block and you have to read it all to verify data integrity. The other reason to match the IO size and the ZFS recordsize is the read-modify-write penalty. With a ZFS recordsize of 128KB, When Innodb modifies a page, if the zfs record is not already in memory, it needs to be read in from the disk and modified before writing to disk. This increases the IO latency significantly. Luckily matching the ZFS recordsize with the IO size removes all the problems mentioned above.
- For Innodb log file, the writes are usually sequential and varying in size. By using ZFS recordsize of 128KB you amortize the cost of read-modify-write.
- You need to set the recordsize before creating the database files. If you have already created the files, you need to copy the files to get the new recordsize. You can use the stat(2) command to check the recordsize (look for IO Block:) zfs set recordsize=16K z0/db
[edit] zfs set primarycache=metadata z0/db (ok)
ZFS кэширует данные в ARC, используя свободную оперативную память. Поскольку страницы InnoDB уже кэшируются в буфер пуле, отключим кэширование файлов данных InnoDB:
[edit] zfs_prefetch_disable = 1 (ok)
Why Most filesystems implement some kind of prefetch. ZFS prefetch detects linear (increasing and decreasing), strided, multiblock strided IO streams and issues prefetch IO when it will help performance. These prefetch IO have a lower priority than regular reads and are generally very beneficial. ZFS also has a lower level prefetch (commonly called vdev prefetch) to help with spatial locality of data.
In Innodb, rows are stored in order of primary index. Innodb issues two kinds of prefetch requests; one is triggered while accessing sequential pages and other is triggered via random access in an extent. While issuing prefetch IO, Innodb assumes that file is laid out in the order of the primary key. This is not true for ZFS. We are yet to investigate the impact of Innodb prefetch.
It is well known that OLTP workloads access data in a random order and hence do not benefit from prefetch. Thus we recommend that you turn off ZFS prefetch.
- If you have changed the primary cache caching strategy to just cache metadata, you will not trigger file level prefetch.
- If you have set recordsize to 16k, you will not trigger the lower level prefetch.
[edit] innodb_buffer_pool_size = NNNNNNNG (ok)
Prefer to cache within MySQL/Innodb over the ARC Why You have multiple levels of caching when you are using MySQL/Innodb with ZFS. Innodb has its own buffer pool and ZFS has the ARC. Both of them make independent decisions on what to cache and what to flush. It is possible for both of them to cache the same data. By caching inside Innodb, you get a much shorter (and faster) code path to the data. Moreover, when the Innodb buffer cache is full, a miss in the Innodb buffer cache can lead to flushing of a dirty buffer, even if the data was cached in the ARC. This leads to unnecessary writes. Even though the ARC dynamically shrinks and expands relative to memory pressure, it is more efficient to just limit it.In our tests, we have found that it is better (7-200%) to cache inside Innodb rather than ZFS.
- The ARC can be tuned to cache everything, just metadata or nothing on a per filesystem basis. See below for tuning advise about this.
[edit] If you have a write heavy workload, use a seperate intent log (slog)
How zpool add log c4t0d0 c4t1d0 Why Write latency is extremely critical for many MySQL workloads. Typically, a query will read some data, do some calculations, update some data and then commit the transaction. To commit, the Innodb log has to be updated. Many transactions can be committing at the same time. It is very important that this "wait" for commit be fast. Luckily in ZFS, synchronous writes can be accelerated up by using the Seperate Intent Log. In our tests with Sysbench read-write, we have seen around 10-20% improvement with the slog.
- If your query execution involves a physical read from disk, the time for the write may not be that important. Be sure to check this suggestion with your real workload.
- Until Bug 6574286 is fixed, you cannot remove a slog.
- Innodb actually issues multiple kinds of writes (log write, dataspace write, insert buffer write). Of these, the most critical one is the Innodb log write. The slog feature is pool wide and thus some writes (like dataspace writes), which need not go to the slog still do. This will be fixed via Bug 6832481 ZFS separate intent log bypass property
- It is also possible that during ZFS transaction sync time, the ZFS IO queue (35 deep) can get full. This means that a write has to wait for a slot to become empty. Bug 6471212: need reserved I/O scheduler slots to improve I/O latency of critical ops solves this using reserved slots. Bug 6721168 slog latency impacted by I/O scheduler during spa_sync is also worth checking out.
[edit] L2ARC (or Level 2 ARC)
How zpool add cache c4t0d0 Why If your database does not fit in memory, every time you miss the database cache, you have to read a block from disk. This cost is quite high with regular disks. You can minimize the database cache miss latency by using a (or multiple) SSDs as a level-2 cache or L2ARC. Depending on your database working set size, memory and L2ARC size you may see several orders of magnitude improvement in performance. Note
[edit] When it is safe, turn off ZFS cache flush
- How
The ZFS Evil tuning guide has more information about setting this tunable. Refer to it for the best way to achieve this.
- Why
ZFS is designed to work reliably with disks with caches. Everytime it needs data to be stored persistantly on disk, it issues a cache flush command to the disk. Disks with a battery backed caches need not do anything (i.e the cache flush command is a nop). Many storage devices interpret this correctly and do the right thing when they receive a cache flush command. However, there are still a few storage systems which do not interpret the cache flush command correctly. For such storage systems, preventing ZFS from sending the cache flush command results in a big reduction in IO latency. In our tests with Sysbench read-write test we saw a 30% improvement in performance.
- Setting this tunable on a system without a battery backed cache can cause inconsistencies in case of a crash.
- When comparing ZFS with filesystems that blindly enable the write cache, be sure to set this to get a fair comparison.
[edit] mbuffer
mbuffer -s 128k -m 1G -I 9090 | zfs receive data/filesystem zfs send data/filesystem | mbuffer -s 128k -m 1G -O dale:9090

