Coding-for-ssds (翻译)

SSD博客—Coding-for-ssds(2014年)

SSD入门级博客教程,原博客提供了中译和韩译版本,但是中译版挂了。在学习过程中为了更好地理解顺便做了一下翻译的工作。才疏学浅,有不恰当之处还望指点。

备注:
1,每一节后面的引用框是一小节的总结
2,Part1部分省略了,有需要可以在下面的链接查看
3,最后一部分秉着尊重作者的念头放了英文原版

自己读完的感受:
1,这篇博客是一个入门SSD很好的材料,语言通俗易懂,并且像综述一样,读完能够对SSD这个领域有一个宏观的理解。
2,包括了大部分SSD的基础知识,但深度不够,还需要阅读更多深入的资料。
3,年代比较久远,虽然基础知识永不过时,但是还是要时常Mark一下最新的研究,避免与时代脱轨。
4,自己对这里面某些知识点还没有理解,需要复盘。

来源:http://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/

Part 2: Architecture of an SSD and Benchmarking

1,Structure of an SSD

<1.1> NAND 闪存单元(NAND-flash memory cells)

SSD是基于闪存的数据存储设备。比特存储在由浮栅(floating-gate)晶体管制成的cells中。SSD完全由电子组件制成,没有像硬盘驱动器那样的活动或机械部件。

电压(Voltage) 施加到晶体管上,这就是读取,写入和擦除(erase) 位的方式。

存在两种用于晶体管布线(wiring transistors) 的解决方案:NOR flash memory 和 NAND flash memory。这里只考虑NAND flash memory。若要查找更多这两者的区别,可以到 [http://arstechnica.com/information-technology/2012/06/inside-the-ssd-revolution-how-solid-state-disks-really-work/]

Limited lifespan: Each cell has a maximum number of P/E cycles (Program/Erase), after which the cell is considered defective. NAND-flash memory wears off and has a limited lifespan. The different types of NAND-flash memory have different lifespans

最近(14年) 的研究表示,在NAND芯片上施加很高的温度,可以清除捕获的电子。SSD的寿命可以大大增加,但这仍在研究之中。

目前的几种cell类型:

  • Single level cell (SLC) , 晶体管只存储 1 bit, 但是lifespan 长
  • Multiple level cell (MLC) , 晶体管存储2 bit, 相比SLC以更高的延迟和缩短的寿命为代价
  • Triple level cell (TLC) , 晶体管存储3 bit,但是有更高的延迟和更短的寿命

Memory cell types: A solid-state drives (SSD) is a flash-memory based data storage device. Bits are stored into cells, which exist in three types: 1 bit per cell (single level cell, SLC), 2 bits per cell (multiple level cell, MLC), and 3 bits per cell (triple-level cell, TLC).

下表展示了每种NAND-flash cell 类型的具体信息,为了比较,添加了机械硬盘(HDD),主存(RAM),和L1/L2 caches的一般延迟。

在这里插入图片描述

相同数量的晶体管拥有更多的比特数能够降低制作成本。SLC是最为可靠的,比MLC和TLC的寿命更长,但是制作成本更高。因此,大多数SSD都是基于MLC和TLC的,只有专业的SSD才是基于SLC。可以根据工作的类型来选择:对于high-update的工作,SLC是最适合的;对于high-read和low-write的工作,TLC会更加合适。

此外,在实际使用的工作负载下,TLC驱动的基准测试表明,基于TLC的SSD的寿命在实践中并不重要。[http://us.hardware.info/reviews/4178/10/hardwareinfo-tests-lifespan-of-samsung-ssd-840-250gb-tlc-ssd-updated-with-final-conclusion-final-update-20-6-2013]

NAND-flash pages and blocks: cells 组合成 block, blocks 组合成 plane. 可以读取或写入块的最小单位是 page. 无法单独擦除页面,只能擦除整个块。NAND-flash page的大小可能不同,大多数的page大小为2KB, 4KB, 8KB或16KB。大多数SSD具有128页或256页的块,这意味着块的大小可以在256KB到4MB之间变化。

例如:Samsung(三星) 840 EVO的块大小为2048KB,每个块包含256页,每页8KB。

在这里插入图片描述

<1.2> SSD的组织(Organization of an SSD)

下面的基本原理图表示了SSD驱动和它的主要组成部分:

在这里插入图片描述

  • 来自用户的命令通过host interface。当时(14年)最普遍的两种SSD接口 (interface) 是Serial ATA (SATA) 和PCI Express (PCIe)。
  • SSD controller中的processor接收命令,并将其传递给Flash controller.
  • SSD还具有嵌入式RAM内存,通常用于缓存和存储映射信息(第4节更详细地介绍了映射策略)
  • NAND-flash memory通过多个通道以帮派 (gangs) 形式组织(在第6节进行了介绍)

接下来的图片展示了SSD在现实生活中的样子。

  1. 512GB Samsung 840 Pro SSD (released in 2013.08)

在这里插入图片描述
在这里插入图片描述

从它的电路板上可以看到,主要的组成部分包括:

  • 1个 SATA 3.0 接口
  • 1个 SSD controller (Samsung MDX S4LN021X01-8030)
  • 1个 RAM module (256 MB DDR2 Samsung K4P4G324EB-FGC2)
  • 8个 NAND-flash modules, each offering 64GB of storage (Samsung K9PHGY8U7A-CCK0)
  1. Micron P420m Enterprise PCIe (released late 2013)
    在这里插入图片描述
    在这里插入图片描述

    它的主要组成部分包括:

    • 8 lanes of a PCIe 2.0 interface
    • 1个 SSD controller
    • 1个 RAM module (DRAM DDR3)
    • 64个 MLC NAND-flash modules over 32 channels, each module offering 32GB of storage (Micron 31C12NQ314 25nm)

    总共的存储空间有2048GB,但是超额配置后只能使用1.4TB

<1.3> 制作流程(Manufacturing process)

很多SSD制造商使用 表面贴装技术(surface-mount technology, SMT)来生产SSD,这种方式将电子组件直接放置在印刷电路板(printed circuit board, PCB)顶部。

2,Benchmarking and performance metrics

<2.1> 基本基准 (Basic benchmarks)

下表展示了不同固态驱动(2008年-2013年)在顺序和随机工作量下的吞吐量,为了对比加入了HDD和RAM memory chip.

在这里插入图片描述

影响性能的一个重要因素是 host interface. 当前(14年)最普遍的SSD接口类型是SATA 3.0 和 PCIe 3.0.

在SATA 3.0接口上,数据传输速率最高可达6 Gbit/s, 实际上提供了大约550MB/s的速度。

在PCIe 3.0接口上,每条通道最多可以传输8 GT/s的数据,实际上大约1 GB/s. PCIe 3.0接口不仅仅是一个通道(lane),通过4个通道,PCIe 3.0可以提供最大4 GB/s的带宽,这是SATA 3.0的八倍。

Interface is the bottleneck for SSDs. 使用PCIe 3.0或SAS (Serial Attached SCSI interface) 可以显著提高性能。

PCI Express and SAS are faster than SATA: The two main host interfaces offered by manufacturers are SATA 3.0 (550 MB/s) and PCI Express 3.0 (1 GB/s per lane, using multiple lanes). Serial Attached SCSI (SAS) is also available for enterprise SSDs. In their latest versions, PCI Express and SAS are faster than SATA, but they are also more expensive.

<2.2> 事先思想准备(Pre-conditioning)

制造商总是尽力去显示好看的数据,但实际中SSD在随机写入时总会表现下降。

下图显示了 预处理 对多个SSD的影响。大约30分钟后,性能会明显下降,这会降低所有驱动器的吞吐量并增加延迟,然后,性能又需要4小时才能逐渐衰减到恒定1的最小值。

在这里插入图片描述

上图发生的原因是,随机写入的数量很大,并且以持续的方式应用,以至垃圾回收 (garbage collection) 过程无法在后台运行。垃圾回收必须在写入命令到达时擦除块,因此会与主机中的前台操作竞争。

使用预处理的人们声称,它产生的基准可以准确地表示驱动器在最坏状态下的行为。对于驱动器在所有工作负载下的运行方式而言,这是不是一个很好的模型是有争议的。

为了比较不同制造商的各种型号,必须找到一个共同点。但是一个SSD只服务一个系统(有着自己独特的工作负载)。因此比较不同驱动器更好,更准确的方式是在这些驱动器上运行此工作负载的相同重放,然后比较各自的性能。这就是为什么即使用持续不断的写入工作量进行预处理可以公平地比较不同的SSD,但也需要谨慎,并且应尽可能根据目标工作量运行内部基准测试。

基准测试(Benchmarking)很难: 测试者是人,因此并非所有基准测试都免除错误。阅读来自制造商或第三方的基准时要小心,并在信任任何数字之前使用多个来源。尽可能使用系统的特定工作负荷以及要使用的特定SSD模型运行自己的内部基准测试。

<2.3> 工作负载和指标(Workloads and metrics)

通常使用的参数如下:

  • 负载类型 (type of workload):可以是基于从用户收集的数据的特定基准,或者只能是相同类型的顺序或随机访问(例如,仅随机写入)
  • 并发执行的读写百分比(percentages of reads and writes performed concurrently)(例如,30%读,70%写)
  • 队列长度(length of queue): 驱动器上运行命令的并发执行线程的数量。
  • 被访问的数据块大小(the size of the data chunks being accessed)(如,4KB,8KB等)

基准结果使用不同的指标表示,最常见的是:

  • 吞吐量(Throughput): 传输速度,通常为KB/s或MB/s。这是为连续基准测试选择的指标。
  • IOPS: 每秒输入/输出操作数,每个操作具有相同的数据块大小(通常为4KB/s)。这是为随机基准选择的指标。
  • 延迟(Latency):发出命令后设备的响应时间,通常以微秒或毫秒为单位。

吞吐量很好理解,IOPS更难理解一些。一个硬盘的性能:对于4KB的块具有1000 IOPS,那么其吞吐量为:1000*4096=4MB/s 。因此,只有块尽量大,IOPS尽量高,意味着吞吐量才高。

另外需要知道的是,高吞吐量并不意味着一个快的系统。实际上,如果延迟高了,无论多么吞吐量多好,整个系统还是很慢。

这一节的重点是,关注所有的指标是很重要的,因为它们显示了系统的不同方面,并在出现瓶颈的时候能够发现它们。

关于该主题的一个有趣的文章是 “IOPS are a scam” (IOPS是一个骗局)[http://www.brentozar.com/archive/2013/09/iops-are-a-scam/]

Part 3: Pages, Blocks, and the Flash Translation Layer

这部分将会说明:

  • 如何在页面和块级别处理写操作,并讨论写放大 (write amplification) 和磨损均衡 (wear leveling) 的基本概念。
  • 描述什么是 Flash Translation Layer (FTL), 并介绍其两个主要目的,即逻辑映射和垃圾回收。更具体地说,将解释在混合日志块映射的上下文中写操作如何工作。

3,Basic operations

< 3.1> Read, write, erase

由于NAND闪存单元的组织形式,对单个单元进行读或写操作是不允许的。存储是分组的,并且以非常特定的属性(properties)进行访问。理解这些属性对于优化SSD的数据结构和理解这些行为是很关键的。

Reads are aligned on page size

不能一次读取少于一页。当然,能向操作系统请求一个字节,但是将在SSD中检索整页,从而迫使读取的数据量超出了必要。

Writes are aligned on page size

写入SSD时,是以页面大小作为增量进行写入的。因此,即使写操作仅影响一个字节,也将写入整个页面。写超出必要数据的数据成为写放大(write amplification),在3.3节中将会介绍。

另外,将数据写入 (write) 页面,有时也被称为 “对页面进行编程 (program) ”,术语write 和 program可以互换。

Page cannot be overwritten

一个NAND-flash 页只有当其处于 “free” 状态时才能被写入。

当A页面数据改变时,A页中的内容被复制到一个内部的寄存器中,这个数据的新版本被存储在一个“free”的B页面。这个操作称为“read-modify-write”. 数据的更新并不是原地完成的。

数据持久保存到驱动器后,原始页面将被标记为“stale” (旧的),并且将保持原状直到被擦除。

Erases are aligned on block size

页面不能重写,并且一旦它们变成stale,唯一使它们重新free的方法就是擦除(erase)。然而,不能单独擦除一个页面,只能一次擦除一个block.

从用户角度来看,访问数据时只能发出read和write命令,erase命令是处于SSD controller中的garbage collection process自动触发的。

< 3.2> Example of a write

下图是写操作的例子,这里的矩形表是NAND-flash 封装的简化表示。

在图中的每个步骤,原理图右侧的项目符号都说明了正在发生的事情。

在这里插入图片描述

< 3.3> 写放大(write amplification)

因为write是按照页面大小对齐的,所以任何既不对齐页面大小又不是页面大小倍数的写操作,将需要写入比必要数量更多的数据,这一概念称为写放大。(写入一个字节将最终写入一页,对于某些型号的SSD而言,这可能多达16KB,而且效率极低)

事实上,以不对齐的方式写入数据将会导致页面在被修改之前被读入缓存,然后再写回驱动器,这比直接将页面直接写入磁盘要慢。这种操作称为 read-modify-write,应该尽量避免。

Never write less than a page

避免写入小于NAND-flash page大小的数据块,以最大程度减少写放大和read-modify-write操作。默认情况下应使用的页面大小当前(14年)为16KB。这个大小取决于SSD模块,同时你需要在未来增加它来提升SSD。

Align writes

对齐页面大小写入,并写入页面大小倍数的数据块

Buffer small writes

为了最大程度地提高吞吐量,请尽可能在RAM中的缓冲区中保留少量写操作,并且在缓冲区已满时,执行一次大型写操作以批处理所有小型写操作。

< 3.4> 磨损均衡(wear leveling)

之前说过的,NAND-flash cells 具有有限的生命期(由于它们有限的Program/Erase周期)。

Let’s imagine that : 我们有一个假想的SSD,数据总是在相同的块上进行write和read。这个块很快就超过了它的P/E 周期限制,然后wear off了,SSD controller将其标记为unusable。这样,整个磁盘的容量就降低了。假设你买了一个500G的盘,2年后就只剩下250G了,这是无可容忍的!

正因为这样的原因,SSD controller的一个主要目标就是实现磨损均衡(wear leveling),在各个blocks间尽可能均匀地分配P/E周期。在理想情况下,所有的block在达到它们的P/E cycle limit 和 wear off是同时的。

为了达到最佳的整体磨损平衡,在写入时 SSD controller 需要明智地选择块,并且可能不得不在某些块之间移动,这个过程本身会导致写放大的增加。因此,块管理是在最大化磨损均衡和最小化写放大之间进行的权衡。

制造商想出了很多机能去实现磨损均衡,比如垃圾回收(garbage collection)

Wear leveling: Because NAND-flash cells are wearing off, one of the main goals of the FTL (Flash Translation Layer) is to distribute the work among cells as evenly as possible so that blocks will reach their P/E cycle limit and wear off at the same time.

4,Flash Translation Layer (FTL)

<4.1> 拥有FTL的必要

能够轻松采用SSD的主要原因是它和HDD具有相同的host interface。尽管提供逻辑快地址 (LBA) 阵列对于HDD来说是有意义的,因为它们的扇区可以被覆盖,但它并不完全适合闪存的工作方式。

因此,需要一个额外的组件来隐藏NAND flash memory的内部特性并将LBA阵列暴露给主机(host)。该组件称为Flash Translation Layer(FTL),位于SSD controller中。

FTL是至关重要的,它有两个主要目的:逻辑块映射和垃圾回收。

<4.2> 逻辑块映射(Logical block mapping)

逻辑块映射将来自用户空间的逻辑块地址 (logical block addresses, LBAs) 转换为物理NAND闪存空间中的物理块地址(physical block addresses, PBAs)。映射采用表格的形式,对于任何LBA,该表格都会提供相应的PBA。

此映射表存储在SSD的RAM中以提高访问速度,并在断电(power failure)时保存在闪存中。SSD通电后,将从持久版本中读取表并将其重建到SSD的RAM中。

一种幼稚的想法是使用 page-level mapping 去组织从用户到物理页的映射,这需要更多RAM空间,会增加制造成本。一个Solution是使用 **block-level mapping **,对于空间利用率是一个巨大的改进。但是,在断电时,需要将映射保留到磁盘,若工作负载中包含大量小更新,将写入完整且页面足够的闪存块(full blocks of flash memory will be written whereas pages would have been enough)。这增加了写放大并且使block-level mapping效率低下。

page-level mapping 和 block-level mapping的权衡也是性能与空间的一种权衡。一些研究者尝试找到两全其美的办法,于是催生出了所谓的“混合 (hybrid) ”方法。最常见的是 log-block mapping,它使用类似于日志结构文件系统的方法。传入的写操作被顺序地写入日志块。当日志块已满时,它将和同逻辑块号(LBN)关联的数据块合并为一个空闲块(free block)。只需要维护几个日志块,从而可以用页面粒度对其进行维护。相反,数据块以块粒度维护 [A Reconfigurable FTL (Flash Translation Layer) Architecture for NAND Flash-Based Applications, Park et al., 2008]

在这里插入图片描述

上图展示了一个简化表示的 hybrid log-block FTL,每一个块仅有4个页面。FTL处理4个写操作,每个写操作都具有一个整页的大小。逻辑页号5,9都被解析为LBN=1,这与物理块#1000相关联,在日志块映射表中LBN=1的条目处,所有物理页偏移量均为空,并且日志块#1000也为空。LPN=5处的第一个写入b’通过日志块映射表解析为LBN=1,该表与PBN=1000(日志块#1000)相关联。因此,页b‘(因为一个写操作占一个页)被写入块#1000中的物理偏移量0。现在需要更新映射的元数据,为此,与逻辑偏移量1(此示例中为任意值)相关联的物理偏移量从null更新为0.

(个人感觉:Logical page offset的编号对应着data的PPN, 对应数据的操作 (按顺序) 以链表记录到physical page offset中)

继续进行写操作并相应地更新映射元数据。当log block #1000完全充满后,它和同逻辑块关联的数据块合并,此时为块#3000. 可以从 Data block mapping table中检索此信息,该表将逻辑块号映射到物理块号。合并操作产生的数据将被写入空闲块,此时为#9000. 这个操作完成时,块#1000和块#3000能够被擦除并成为空闲块,同时#9000成为数据块(data block)。data-block mapping table 中 LBN=1的元数据就从初始的data block #3000更新为新的data block #9000.

这里要注意的一点是,这四个写操作仅集中在两个LPN上,使用log-block方法可以在合并期间隐藏b’和d’的操作,并直接使用最新的b’‘和d’'版本,从而实现更好的写放大。

最后,如果read命令正在请求最近更新的页面,并且尚未进行块上的合并操作,则该页面将位于log block中。否则,将在data block上找到一个数据块。这就是为什么read 请求需要同时检查日志块映射表和数据块映射表的原因。

log-block FTL能够优化,其中最值得注意的是 switch-merge,有时也称为 swap-merge. 假设逻辑块中所有地址都是一次写入的。这意味着这些地址的所有数据将被写入同一日志块。由于该日志块包含整个逻辑块的数据,因此将这个日志块与数据块合并为一个空闲块将是没有用的,因为生成的空闲块将包含与该日志块完全相同的数据。仅更新数据块映射表中的元数据,并将数据块映射表中的数据块切换为日志块会更快,这是一个switch-merge.

因此,写入至少为NAND闪存块大小的数据块更为有效,因为对于FTL,他最大程度地减少了更新映射及其元数据的开销。

log-block mapping 策略已经称为很多论文的主题,导致了一系列的改善,如FAST(Fully Associative Sector Translation),superblock mapping 和 flexible group mapping. [A Reconfigurable FTL (Flash Translation Layer) Architecture for NAND Flash-Based Applications, Park et al., 2008] 这里还有其他映射方案,例如Mitsubishi算法和SSR,一下两篇文章是了解FTL和映射方案的两个重要起点:

Flash Translation Layer: The Flash Translation Layer(FTL) is a component of the SSD controller which maps Logical Block Address (LBA) from the host to Physical Block Address (PBA) on the drive. Most recent drives implement an approach called “hybrid log-block mapping” or one of its derivatives, which works in a way that is similar to log-structured file systems. This allows random writes to be handled like sequential writes.

<4.2> 垃圾回收(Garbage collection)

就像4.1, 4.2中解释的那样,页面不能被重写。如果一个页面上的数据更新了,新版本会写到一个空闲页上,包含原来版本数据的页被标记为过时(stale)。若块包含过时的页,那么在他们写之前需要擦除。

Garbage collection: The garbage collection process in the SSD controller ensures that “stale” page are erased and restored into a “free” state so that the incoming write commands can be processed.

由于擦除命令与写入命令相比需要较高的等待时间,额外的擦除步骤会导致延迟,从而使写入速度变慢。因此,一些controller实现后台垃圾回收过程,也成为garbage collection, 该过程充分利用空闲时间并在后台定期运行以回收陈旧页面,并确保将来到的前台操作有足够的页面来实现最高性能。其他实现使用并行垃圾回收方法,该方法与主机的写操作并行执行垃圾回收操作。

Background operations can affect foreground operations: Background operations such as garbage collection can impact negatively on foreground operations from the host, especially in the case of a sustained workload of small random writes.

块需要移除的一个次要原因是 读干扰(read disturb)。读操作能该改变临近单元的状态,所以在一定数量的读操作后需要对块进行移除(move)。

数据变化的速率(rate)也是一个重要原因。有些数据很少改变,被称为cold or static data, 然而有的数据更新地很频繁,被称为 hot or dynamic data. 如果一个页面存储一部分冷数据一部分热数据,为了磨损均衡,在垃圾回收阶段冷数据会随着热数据一起被拷贝,这样会增加写放大。可以简单地将冷热数据分离,存储在不同的页来规避这种情况。这样做有一个缺点,就是存储冷数据的页面会更少地被擦除,因此为了磨损平衡需要定期交换存储冷数据和热数据的块。

由于数据的热度(hotness)是在应用程序级被定义的,FTL没有办法知道一个页面中有多少个热数据和冷数据。一个提升SSD性能的方法是分离热数据和冷数据到不同的页,这样将会使垃圾回收的工作更加轻松。

Split cold and hot data: Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easier.

Buffer hot data: Extremely hot data should be buffered as much as possible and written to the drive as infrequently as possible.

Invalidate obsolete data in large batches: When some data is no longer needed or need to be deleted, it is better to wait and invalidate it in a large batches in a single operation. This will allow the garbage collector process to handle larger areas at once and will help minimizing internal fragmentation.

Part 4: Advanced Functionalities and Internal Parallelism

这一部分,将简要介绍一些主要的SSD功能,例如TRIM和over-provisioning. 还将介绍SSD中不同级别的内部并行性(internal parallelism)以及集群块(clustered block)的概念。

5,Advanced functionalities

<5.1> TRIM

假设某个程序将文件写入SSD的所有逻辑块地址,那么该SSD将被视为已满。现在,假如所有这些文件都已删除,文件系统将报告100%空闲空间,尽管驱动器仍然是满的,因为SSD controller无法知道主机何时删除逻辑数据。仅当用于保存文件的逻辑块地址被覆盖时,SSD controller才会看到这些可用空间。在这个时候,垃圾回收过程将擦除与已删除文件相关联的块,从而为传入的写操作提供可用的页面。结果,删除被延迟了,而不是立即删除已保存的过时数据块(因为需要擦除操作),这严重损害了性能。

另一个问题是,由于SSD controller不知道包含已删除文件的页面,因此垃圾回收机制将继续移动它们以确保磨损均衡。这增加了写放大,并且毫无理由地干扰了主机的前台工作负载。

解决延迟擦除的一个方法是 TRIM 命令,它能够通过操作系统通知SSD controller 某个页面在逻辑空间已经不再不使用了。利用这个信息,垃圾回收过程就知道不需要移动这些页面,并且能够在需要的时候擦除它们。TRIM命令仅在支持这个命令的SSD con8troller, 操作系统和文件系统上工作。

Wikipedia中关于TRIM的命令词条如下 [https://en.wikipedia.org/wiki/Trim_(computing)],包含了支持这个命令的操作系统,文件系统。对于Windows 7, 它仅支持使用SATA接口的SSD的TRIM命令,而不支持PCI-Express。

<5.2> Over-provisioning (过度配置,预留空间过大)

通过保持预留给控制器且用户看不见的物理块的比例,过度配置简单的就是比逻辑块拥有更多的物理块。大多数专业SSD的制造商已经提供了一些超额配置,通常大约为7%-25% 。用户可以将磁盘分区到比其最大物理容量低的逻辑容量来创建更多的超额配置。

尽管操作系统看不到超配配置的空间,但是SSD controller可以看见。制造商超额配置的主要原因是,应对NAND-flash cells固有的有限寿命。

The invisible over-provisioned blocks are here to seamlessly replace the blocks the blocks wearing off in the visible space.

安南科技(AnandTech)有一篇有趣的文章展示了over-provisioning 在SSD life-span和performance上的作用。[http://www.anandtech.com/show/6489] . Percona的文章介绍了另一个有趣的结果,它们测试了Intel 320 SSD, 并显示随着磁盘被填满,写入吞吐量降低了。[http://www.ssdperformanceblog.com/2011/06/intel-320-ssd-random-write-performance/]

这是作者对这一现象(over-provisioning提高性能)的解释。garbage collection 在后台利用空闲时间擦除过时页面。但是,由于擦除操作比写操作具有更高的延迟,因此在连续随机写的繁重工作负载下,SSD会在垃圾回收有足够时间擦除过期页面之前耗尽所有空闲块。那时,FTL将无法赶上随机写入的前台工作负载,并且垃圾回收过程不得不在写入命令进入的同时擦除块。

因此,over-provisioning 可以充当吸收高吞吐量写入工作负载的缓冲区,从而为垃圾回收留出足够的时间来再次捕获和擦除块。那么需要预留多少空间呢?根据经验是25% [http://www.anandtech.com/show/6489]

Over-provisioning is useful for wear leveling and performance:

A drive can be over-provisioned simply by formatting it to a logical partition capacity smaller than the maximum physical capacity. The remaining space, invisible to the user, will still be visible and used by the SSD controller.

Over-provisioning helps the wear leveling mechanisms to cope with the inherent limited lifespan of NAND-flash cells. For workloads in which writes are not so heavy, 10% to 15% of over-provisioning is enough. For workloads of sustained random writes, keeping up to 25% of pre-provisioning will improve performance. The over-provisioning will act as a buffer of NAND-flash blocks, helping the garbage collection process to absorb peaks of writes.

<5.3> Secure Erase (安全擦除)

一些SSD controller提供了ATA安全擦除功能,目标是将驱动器的性能恢复到全新的开箱即用状态。该命令将擦除用户写入的所有数据并重置FTL映射表,但是显然不能克服有限P/E周期的物理限制。

性能的含义很重要,并且对于安全性而言也越来越重要。在Stack Overflow上有关于如何可靠地从SSD擦除数据的讨论。[http://security.stackexchange.com/questions/12503/can-wiped-ssd-data-be-recovered] [http://security.stackexchange.com/questions/5662/is-it-enough-to-only-wipe-a-flash-drive-once]

<5.4> Native Command Queueing(NCQ,本机命令队列)

本机命令队列(Native Command Queueing, NCQ)是Serial ATA的一项功能,它允许SSD接收来自主机的多个命令,以便使用内部并行机制同时完成它们。除了减少由于驱动器引起的延迟外,一些较新的驱动器还使用NCQ来应对主机的延迟。例如,NCQ可以区分输入命令的优先级,以确保驱动器始终在主机CPU忙时处理命令。[https://en.wikipedia.org/wiki/Native_Command_Queuing]

<5.5> Power-loss protection (断电保护)

无论是在家中还是数据中心,都会发生断电。一些制造商在其SSD架构中包括一个超级电容器(supercapacitor),它有足够的功率,以便在断电时在总线中提交I/O请求,病逝驱动器处于一致状态。但问题是,并不是所有SSD制造商都为其驱动器提供超级电容器或者某种电源故障数据保护。然后,就像安全擦除命令一样,目前尚不清楚电源故障机制是否正确实现,并且断电时是否确实可以保护驱动器免受数据损坏。

SSD仍然是一项非常年轻的技术,作者相信,下一代在电源故障下对数据损坏的抵抗能力将会提高。尽管如此,暂时在数据中心设置中购买不间断电源(UPS)可能是值得的。与任何其他存储解决方案一样,定期备份敏感数据。

6,Internal Parallelism in SSDs

<6.1> Limited I/O bus bandwidth

由于物理限制,异步NAND-flash I/O总线不能提供超过32-40MB/s的带宽。[Exploring and Exploiting the Multilevel Parallelism Inside SSDs for Improved Performance and Endurance, Hu et al., 2013]. SSD制造商提高性能的唯一方法是设计驱动器,以使多个封装可以并行或交错(parallelized of interleaved)。

通过组合SSD中所有等级的内部并行,可以跨单独的(separate)芯片同时访问多个块,这样的单元称为集群块(clustered block)。这里并不打算介绍有关SSD内部并行性的所有细节,仅仅是简要介绍并行性和集群块的级别。要了解更多SSD内的并行性,可以从下面两个重要的出发点探索 [Parameter-Aware I/O Management for Solid State Disks (SSDs), Kim et al., 2012] [Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing, Chen et al, 2011]

Internal parallelism: Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”.

<6.2> Multiple levels of parallelism

下图显示了NAND-flash package的内部结构,该封装以层次结构进行组织。级别为通道(channel),包(package),芯片(chip),平面(plane),块(block)和页面(page)。

在这里插入图片描述

这些不同的级别提供了以下并行性:

  • Channel-level parallelism. 闪存控制器通过多个通道与闪存包进行通信。这些通信可以独立访问,也可以同时访问。每个通道都由多个程序包共享。
  • Package-level parallelism. channel上的package可以独立访问,交织(Interleaving)可用于在同一通道共享的包上运行命令。
  • Chip-level parallelism. 一个package包括2个或以上的芯片,它们可以并行地独立访问。 Note: chips are also called “dies”.
  • Plane-level parallelism. 一个芯片包括2个或以上的平面,相同的操作(read,write或erase)能够在一个芯片上的多个plane上同时运行。Plane包括blocks,block包括pages。Plane也包括用于平面级操作的寄存器(small RAM buffers).

<6.3> Clustered blocks

跨多个芯片访问的多个块称为集群块(clustered block), 这个想法类似于RAID系统中遇到的条带化(striping)概念。

一次访问的逻辑块地址被分装在不同的闪存包中的不同的SSD芯片上。这要归功于FTL的映射算法,并且与那些地址是否顺序无关。条带化块(striping blocks)允许同时使用多个通道以及合并其带宽,还可以并行执行多个读取,写入和擦除操作。这意味着对齐的I/O操作和集群块大小的倍数都可以确保SSD中各种级别的内部并行性提供的所有性能得到最佳利用。

Part5: Access Patterns and System Optimizations

在上一个部分,我们介绍了SSD的大多数内部工作原理,接下来将提供一些数据,以了解应该使用哪种访问模式以及为什么它们确实比其他访问模式更好。在这一部分中,解释了应该如何执行写操作,应该如何执行读操作以及为什么并发的读写操作会受到干扰。还将介绍一些可以提高性能的文件系统级别的优化。

7,Access patterns

<7.1> Defining sequential and random I/O operations (定义顺序和随机I/O操作)

在下面的小节中,我们将访问(access)称为“顺序(sequential)”或“随机(random)”。

  • sequential: 一个I/O操作的起始逻辑块地址(LBA)直接在前一个I/O操作的最后一个LBA之后
  • random: 不是顺序的,就是随机的

需要注意的是,由于FTL执行的动态映射,逻辑空间中连续的地址可能在物理空间中是不连续的。

<7.2> Writes

基准(Benchmarks)和制造商数据表显示,random writes 比 sequential writes 要慢,尽管并非总是如此,因为它取决于随机写入工作负载的确切类型。如果写入的大小很小(小于集群块的大小,如<32MB),随机写入的速度确实比顺序写入的速度慢。但是,如果随机写入是集群块大小的倍数并与之对齐,则它们的执行效果与顺序写入一样好。

解释如下:如第6节中所述,SSD中的内部并行性允许结合使用并行性和交错功能一次写入集群块。因此,无论它们是顺序的还是随机的,写操作都将在内部以相同的方式在多个通道和芯片上进行条带化,并且执行具有集群块大小的写操作可确保将使用所有内部并行性。

在性能方面,如下面两图,当基准写入缓冲区的大小>=2时,随机写入吞吐量将达到顺序写入一样的吞吐量。集群块的大小对于大多数SSD来说为16或32MB。

Comparison of the effects of a sequential write workload versus a random write workload over four SSDs

在这里插入图片描述

Comparison of the effects of a sequential write workload versus a random write workload over three SSDs

在这里插入图片描述

但是如果写入量很小(小于NAND-flash page,即<16KB),则控制器就还有很多额外的工作要做,易维护块映射所需的元数据(metadata)。确实,某些SSD在使用树状数据结构来表示逻辑快地址与物理块地址之间的映射,并且许多小的随机写入将转化为RAM中映射的大量更新。由于此映射从RAM到闪存一直存在,因此RAM中的所有这些更新将导致对闪存的大量写入。而顺序的工作负载会导致对元数据的更新减少,因此对闪存的写入也会减少。

Random writes are not always slower than sequential writes:

If the writes are small (i.e. below the size of the clustered block), then random writes are slower than sequential writes.

If writes are both multiple of and aligned to the size of a clustered block, the random writes will use all the available levels of internal parallelism, and will perform just as well as sequential writes.

另一个原因是,如果随机写入量较小,则它们将导致对块进行大量的 copy-erase-write 操作。另一方面,至少块大小的顺序写入允许使用更快的 switch merge 优化。此外,已知小的随机写入会随机使数据无效。许多块将只使一页无效,而不是使一些块完全无效,这将使过时的页面散布在物理空间中而不是被本地化。这种现象称为内部碎片 (internal fragmentation) ,通过要求垃圾回收过程运行大量擦除操作来创建空闲页面,从而导致清洁效率 (cleaning efficiency) 下降。

最后,关于并发性,已经表明,使用一个线程编写一个大缓冲区与使用许多并发线程编写许多较小的缓冲区一样快。确实,较大的写入保证了SSD的所有内部并行性都已使用。因此,尝试并行执行多次写入将不会提高吞吐量[1,5]。但是,与单线程访问相比,许多并行写操作将导致等待时间增加[3、26、27]。

A single large write is better than many small concurrent writes

A single large write request offers the same throughput as many small concurrent writes, however in terms of latency, a large single write has a better response time than concurrent writes. Therefore, whenever possible, it is best to perform large writes.

When the writes are small and cannot be grouped or buffered, multi-threading is beneficial

Many concurrent small write requests will offer a better throughput than a single small write request. So if the I/O is small and cannot be batched, it is better to use multiple threads.

<7.3> Reads

读操作比写操作快。对于顺序(sequential)读和随机(random)读,it all depends. FTL将逻辑块动态映射到物理块,并条带(stripes)写入通道。这种方法有时被称为“write-order-based”(基于写顺序)的映射。如果以与原始写入方式不完全匹配的方式完全随机读取数据,则不能保证连续读取会跨不同的通道进行。甚至有可能连续的随机读取正在从单个通道访问不同的块,从而没有利用内部并行性。Acunu对此写了一篇博客表明,至少对他们测试的驱动器而言,读取性能与(读取访问模式与原始写入数据的匹配程度)直接相关。[47]

To improve the read performance, write related data together:

Read performance is a consequence of the write pattern. When a large chunk of data is written at once, it is spread across separate NAND-flash chips. Thus you should write related data in the same page, block, or clustered block, so it can later be read faster with a single I/O request, by taking advantage of the internal parallelism.

下图显示了具有两个channels, 四个chips (每个chip一个plane)的SSD。

注意:

  1. 这个只是示意,在技术上是无效的,因为SSD每个chip始终具有2个或多个plane
  2. 大写字母表示具有一个NAND-flash块大小的数据
  3. 图上最上方的操作表示顺序地写四个块 [A B C D],这个例子是一个集群块的大小
  4. 使用并行和交织在四个平面上对写操作进行条带化,从而使其速度更快
  5. 即使四个块在逻辑块地址空间是顺序的,但是它们在内部被存储在四个不同的plane上

在这里插入图片描述

使用基于写顺序(write-order-based)的FTL,平面中的所有块都可以等可能地被用于传入写操作,因此,集群块将不一定必须由在其各plane中具有相同PBN的块构成。例如,在图10中,第一个集群块由来自四个不同平面的块组成,其在各自平面中的PBN为1,23,11和51.

上图的底部显示了两个读取操作,[A B E F] 和 [A B G H] 。对于 [A B E F],A和E属于同一个plane,而B和F属于另一个plane,因此[A B E F] 只能从一个channel的两个plane上读取。再来看看 [A B G H] ,A,B,G和,H都被存储在4个不同的plane,因此,[A B G H] 能够同时从两个channel,4个plane中读取。<Reading from more planes and more channels enables to take advantage of more of the internal parallelism, therefore granting better read performance>.

内部并行性的直接结果是,尽量使用多个线程同时读取数据,但不一定会提高性能。实际上,如果线程访问的位置不了解内部映射并且不利用内部映射,则它们可能最终访问同一通道。有篇文章还显示,并发读取线程可能会损害SSD的预读(预读取缓冲区)功能。[3]

A single large read is better than many small concurrent reads:

Concurrent random reads cannot fully make use of the readahead mechanism (预读机制). In addition, multiple Logical Block Addresses may end up on the same chip, not taking advantage of the internal parallelism. Moreover, a large read operation will access sequential addresses and will therefore be able to use the readahead buffer if present. Consequently, it is preferable to issue large read requests.

<7.4> Concurrent reads and writes

交织(interleaving)小的read和write会导致性能降低[1,3]。主要原因是读和写争夺着相同的内部资源,并且混合使用它们会阻止某些机制(例如预读)被充分利用。

Separate read and write requests:

A workload made of a mix of small interleaved reads and writes will prevent the internal caching and readahead mechanism to work properly, and will cause the throughput to drop.

It is best to avoid simultaneous reads and writes, and perform them one after the other in large chunks, preferably of the size of the clustered block.

For example, if 1000 files have to be updated, you could iterate over the files, doing a read and write on a file and then moving to the next file, but that would be slow. It would be better to reads all 1000 files at once and then write back to those 1000 files at once.

8,System optimizations

<8.1> Partition alignment

正如3.1中所说,写操作是与页面大小对齐的。一个具有页面大小且与页面对齐的写操作请求将会直接被写入到NAND-flash 物理页上;一个具有页面大小但是不对齐页面的写请求将写入两个NAND-flash物理页,发生两次 read-modify-write 操作[53]。因此,至关重要的是确保用于写入SSD的分区与所使用驱动器的物理NAND-flash页面的大小对齐。有各种指南和教程显示了格式化时如何使分区与SSD的参数对齐[54,55]。

已经表明,分区对齐可以显著提高性能 [43],在一个驱动器上的测试也表明,绕过文件系统并直接写入驱动器可以提高性能,尽管这种改进很小。[44]

Align the partition:

To ensure that logical writes are truly aligned to the physical memory, you must align the partition to the NAND-flash page size of the drive.

<8.2> Filesystem parameters

并非所有文件系统都支持TRIM命令。在Linux 2.6.33以及更高的版本上,ext4和XFS支持TRIM,仍然需要使用throw参数启用它。从那里开始,其他一些调整也将通过删除relatime参数(如果存在)并添加noatime,nodiratime来禁用元数据的更新(如果不需要的话)。[40, 55, 56, 57]

Enable the TRIM command:

Make sure your kernel and filesystem support the TRIM command. The TRIM command notifies the SSD controller when a block is deleted. The garbage collection process can then erase blocks in background during idle times, preparing the drive to face large writes workloads.

<8.3> Operating system I/O scheduler

Linux上默认的I/O调度程序使CFQ调度程序(Completely Fair Queuing, 完全公平排队). CFQ旨在通过将物理上相互靠近的I/O请求分组来最大程度地减少旋转硬盘驱动器中的寻道时间。由于SSD没有机械部件,因此不需要I/O请求重新排序。各种指南和讨论都主张将I/O调度器从CFQ更改为NOOP或Deadline将会减少SSD的延迟 [56,58]。然而,从Linux 3.1版开始,CFQ为固态驱动器提供了一些优化 [59]。基准还表明,调度程序的性能取决于应用于SSD的工作负载以及驱动器本身。

作者的态度是,除非是工作负载十分特数,并且另一种调度程序具有优势。否则可以坚决坚持CFQ。

<8.4> Swap

由于将页面交换到驱动器会产生大量的I/O请求,因此SSD上的交换分区会增加驱动器的磨损率,并显著缩短其使用寿命。在Linux内核中,vm.swappiness参数控制将页面交换到驱动器的频率。它的值可以在0-100之间,0表示内核应避免尽可能多的交换,而100则意味着内核应尽可能多的交换。以Ubuntu为例,默认swappiness是60.

其他选择是使用RAM磁盘进行交换,或者完全避免交换。

<8.5> Temporary files

所有不需要保留的临时文件和日志文件都浪费了SSD上的P/E周期。可以使用tmpfs文件系统将此类文件存储到RAM中。

Part6: A Summary

Basic

1,Memory cell types

A solid-state drives (SSD) is a flash-memory based data storage device.

Bits are stored into cells, which exist in three types: 1 bit per cell (single level cell, SLC); 2 bits per cell (multiple level cell, MLC); 3 bits per cell (triple-level cell, TLC).

source: section 1.1

2,Limited lifespan

Each cell has a maximum number of P/E cycles (Program/Erase), after which the cell is considered defective. This means that NAND-flash memory wears off and has a limited lifespan.

source: section 1.1

3,Benchmarking(基准测试)is hard

Testers are humans, therefore not all benchmarks are exempt of errors.

Be careful when reading the benchmarks from manufacturers of third parties, and use multiple sources before trusting any numbers. Whenever possible, run your own in-house benchmarking using the specific workload of your system, along with the specific SSD model that you want to use. Finally, make sure you look at the performance metrics that matter most for the system at hand.

source: section 2.2 and 2.3

Pages and blocks

4,NAND-flash pages and blocks

Cells are grouped into a grid, called a block, and blocks are grouped into planes. The smallest unit through which a block can be read or written is a page. Pages cannot be erased individually, only whole blocks can be erased.

The size of a NAND-flash page size can vary, and most drive have pages of size 2KB, 4KB, 8KB or 16KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256KB and 4MB. For example, the Samsung SSD 840 EVO has blocks of size 2048KB, and each block contains 256 pages of 8Kb each.

source: section 3.2

5,Reads are aligned on page size

It is not possible to read less than one page at once. One can of course only request just one byte from the operating system, but a full page will be retrieved in the SSD, forcing a lot more data to be read than necessary.

source: section 3.2

6,Writes are aligned on page size

When writing to an SSD, writes happen by increments of the page size. So even if a write operation affects only one byte, a whole page will be written anyway. Writing more data than necessary is known as write amplification. Writing to a page is also called “to program” a page.

source: section 3.2

7,Pages cannot be overwritten

A NAND-flash page can be written to only if it is in the “free” state. when data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”. The data is not updated in-place, as the “free” page is a different page than the page that originally contained the data. Once the data is persisted to the drive, the original page is marked as being “stale”, and will remain as such until it is erased.

source: section 3.2

8,Erases are aligned on block size

Pages cannot be overwritten, and once they become stale, the only way to make them free again is to erase them. However, it is not possible to erase individual pages, and it is only possible to erase whole blocks at once.

source: section 3.2

SSD controller and internals

9,Flash Translation Layer

The Flash Translation Layer (FTL) is a component of the SSD controller which maps Logical Block Addresses (LBA) from the host to Physical Block Addresses (PBA) on the drive. Most recent drives implement an approach called “hybrid log-block mapping” or one of its derivatives, which works in a way that is similar to log-structured file systems. This allows random writes to be handled like sequential writes.

source: section 4.2

10,Internal parallelism

Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”

source: section 6

11,Wear leveling

Because NAND-flash cells are wearing off, one of the main goals of the FTL is to distribute the work among cells as evenly as possible so that blocks will reach their P/E cycle limit and wear off at the same time.

source: section 3.4

12,Garbage collection

The garbage collection process in the SSD controller ensures that “stale” pages are erased and restored into a “free” state so that the incoming write commands can be processed.

source: section 4.4

13,Background operations can affect foreground operations

Background operations such as garbage collection can impact negatively on foreground operations from the host, especially in the case of a sustained workload of small random writes.

source: section 4.4

Access patterns

14,Never write less than a page

Avoid writing chunks of data that are below the size of a NAND-flash page to minimize write amplification and prevent read-modify-write operations. The largest size for a page at the moment is 16 KB, therefore it is the value that should be used by default. This size depends on the SSD models and you may need to increase it in the future as SSDs improve.

source: section 3.2 and 3.3

15,Align writes

Align writes on the page size, and write chunks of data that are multiple of the page size.

source: section 3.2 and 3.3

16,Buffer small writes

To maximize throughput, whenever possible keep small writes into a buffer in RAM and when the buffer is full, perform a single large write to batch all the small writes.

source: section 3.2 and 3.3

17,To improve the read performance, write related data together

Read performance is a consequence of the write pattern. When a large chunk of data is written at once, it is spread across separate NAND-flash chips. Thus you should write related data in the same page, block, or clustered block, so it can later be read faster with a single I/O request, by taking advantage of the internal parallelism.

source: section 7.3

18,Separate read and write requests

A workload made of a mix of small interleaved reads and writes will prevent the internal caching and readahead mechanism to work properly, and will cause the throughput to drop. It is best to avoid simultaneous reads and writes, and perform them one after the other in large chunks, preferably of the size of the clustered block.

For example, if 1000 files have to be updated, you could iterate over the files, doing a read and write on a file and then moving to the next file, but that would be slow. It would be better to reads all 1000 files at once and then write back to those 1000 files at once.

source: section 7.4

19,Invalidate obsolete data in batch

When some data is no longer needed or need to be deleted, it is better to wait and invalidate it in a large batches in a single operation. This will allow the garbage collector process to handle larger areas at once and will help minimizing internal fragmentation.

source: 4.4

20,Random writes are not always slower than sequential writes

If the writes are small (i.e. below the size of the clustered block), then random writes are slower than sequential writes.

If writes are both multiple of and aligned to the size of a clustered block, the random writes will use all the available levels of internal parallelism, and will perform just as well as sequential writes. For most drives, the clustered block has a size of 16 MB or 32 MB, therefore it is safe to use 32 MB.

source: section 7.2

21,A large single-threaded read is better than many small concurrent reads

Concurrent random reads cannot fully make use of the readahead mechanism. In addition, multiple Logical Block Addresses may end up on the same chip, not taking advantage or of the internal parallelism. A large read operation will access sequential addresses and will therefore be able to use the readahead buffer if present, and use the internal parallelism. Consequently if the use case allows it, it is better to issue a large read request.

source: section 7.3

22,A large single-threaded write is better than many small concurrent writes

A large single-threaded write request offers the same throughput as many small concurrent writes, however in terms of latency, a large single write has a better response time than concurrent writes. Therefore, whenever possible, it is best to perform single-threaded large writes.

source: section 7.2

23,When the writes are small and cannot be grouped of buffered, multi-threading is beneficial

Many concurrent small write requests will offer a better throughput than a single small write request. So if the I/O is small and cannot be batched, it is better to use multiple threads.

source: section 7.2

24,Split cold and hot data

Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easier.

source: section 4.4

25,Buffer hot data

Extremely hot data and other high-change metadata should be buffered as much as possible and written to the drive as infrequently as possible.

source: section 4.4

System optimizations

26,PCI Express and SAS are faster than SATA

The two main host interfaces offered by manufacturers are SATA 3.0 (550 MB/s) and PCI Express 3.0 (1 GB/s per lane, using multiple lanes). Serial Attached SCSI (SAS) is also available for enterprise SSDs. In their latest versions, PCI Express and SAS are faster than SATA, but they are also more expensive.

source: 2.1

27,Over-provisioning is useful for wear leveling and performance

A drive can be over-provisioned simply by formatting it to a logical partition capacity smaller than the maximum physical capacity. The remaining space, invisible to the user, will still be visible and used by the SSD controller.

Over-provisioning helps the wear leveling mechanisms to cope with the inherent limited lifespan of NAND-flash cells. For workloads in which writes are not so heavy, 10% to 15% of over-provisioning is enough. For workloads of sustained random writes, keeping up to 25% of over-provisioning will improve performance. The over-provisioning will act as a buffer of NAND-flash blocks, helping the garbage collection process to absorb peaks of writes.

source: section 5.2

28,Enable the TRIM command

Make sure your kernel and filesystem support the TRIM command. The TRIM command notifies the SSD controller when a block is deleted. The garbage collection process can then erase blocks in background during idle times, preparing the drive to face large writes workloads.

source: section 5.1

29,Align the partition

To ensure that logical writes are truly aligned to the physical memory, you must align the partition to the NAND-flash page size of the drive.

source: section 8.1

Conclusion

This summary concludes the “Coding for SSDs” article series. I hope that I was able to convey in an understandable manner what I have learned during my personal research over solid-state drives.

If after reading this series of articles you want to go more in-depth about SSDs, a good first step would be to read some of the publications and articles linked in the reference sections of Part 2 to 5.

Another great resource is the FAST conference (the USENIX Conference on File and Storage Technologies). A lot of excellent research is being presented there every year. I highly recommend their website, a good starting point being the videos and publications for FAST 2013.