Original Link: https://www.anandtech.com/show/11210/the-intel-optane-memory-ssd-review-32gb-of-kaby-lake-caching
The Intel Optane Memory (SSD) Preview: 32GB of Kaby Lake Caching
by Billy Tallis on April 24, 2017 12:00 PM EST- Posted in
- Storage
- SSDs
- Intel
- PCIe SSD
- SSD Caching
- M.2
- NVMe
- 3D XPoint
- Optane
- Optane Memory
Last week, we took a look at Intel's first product based on their 3D XPoint non-volatile memory technology: the Optane SSD DC P4800X, a record-breaking flagship enterprise SSD. Today Intel launches the first consumer product under the Optane brand: the Intel Optane Memory, a far smaller device with a price that is 20 times cheaper. Despite having "Memory" in its name, this consumer Optane Memory product is not a NVDIMM nor is it in any other way a replacement for DRAM (those products will be coming to the enterprise market next year, even though the obvious name is now taken). Optane Memory also not a suitable replacement for mainstream flash-based SSDs, because Optane Memory is only available in 16GB and 32GB capacities. Instead, Optane Memory is Intel's latest attempt at an old idea that is great in theory but has struggled to catch on in practice: SSD caching.
Optane is Intel's brand name for products based on the 3D XPoint memory technology they co-developed with Micron. 3D XPoint is a new class of non-volatile memory that is not a variant of flash memory, the current mainstream technology for solid state drives. NAND flash memory—be it older planar NAND or newer 3D NAND flash—has fundamental limits to performance and write endurance, and many of the problems get worse as flash is shrunk to higher densities. 3D XPoint memory takes a radically different approach to non-volatile storage, and it makes different tradeoffs between density, performance, endurance and cost. Intel's initial announcement of 3D XPoint memory technology in 2015 came with general order of magnitude comparisons against existing memory technologies (DRAM and flash). Compared to NAND flash, 3D XPoint is supposed to be on the order of 1000x faster with 1000x higher write endurance. Compared to DRAM, 3D XPoint memory is supposed to be about 10x denser, which generally implies it'll be cheaper per GB by about the same amount. Those comparisons were about the raw memory itself and not about the performance of an entire SSD, and they were also projections based on memory that was still more than a year from hitting the market.
3D XPoint memory is not intended or expected to be a complete replacement for flash memory or DRAM in the foreseeable future. It offers substantially lower latency than flash memory but at a much higher price per GB. It still has finite endurance that makes it unsuitable as a drop-in replacement for DRAM without some form of wear-leveling. The natural role for 3D XPoint technology seems to be as a new tier in the memory hierarchy, slotting in between the smaller but faster DRAM and the larger but slower NAND flash. The Optane products released this month are using the first-generation 3D XPoint memory, along with first-generation controllers. Future generations should be able to offer substantial improvements to performance, endurance and capacity, but it's too soon to tell how those characteristics will scale.
The Intel Optane Memory is a M.2 NVMe SSD using 3D XPoint memory instead of NAND flash memory. 3D XPoint allows the Optane Memory to deliver far higher throughput than any flash SSD of equivalent capacity, and lower read latency than a NAND flash SSD of any capacity. The Optane Memory is intended both for OEMs to integrate into new systems and as an aftermarket upgrade for "Optane Memory ready" systems: those that meet the system requirements for Intel's new Optane caching software and have motherboard firmware support for booting from a cached volume. However, the Optane Memory can also be treated as a small and fast NVMe SSD, because all of the work to enable its caching role is performed in software or by the PCH on the motherboard. 32GB is even (barely) enough to be used as a Windows boot drive, though doing so would not be useful for most consumers.
Intel Optane Memory uses a PCIe 3.0 x2 link, while most M.2 PCIe SSDs use the full 4 lanes the connector is capable of. The two-lane link allows the Optane Memory to use the same B and M connector key positions that are used by M.2 SATA SSDs, so there's no immediate visual giveaway that Optane Memory requires PCIe connectivity from the M.2 socket. The Optane Memory is a standard 22x80mm single-sided card but the components don't come close to using the full length. The controller chip is far smaller than a typical NVMe SSD controller, and the Optane Memory includes just one or two single-die packages of 3D XPoint memory. The Optane Memory module has labels on the front and back that contain a copper foil heatspreader layer, positioned to cool the memory rather than the controller. There is no DRAM visible on the drive.
Intel Optane Memory Specifications | ||
Capacity | 16 GB | 32 GB |
Form Factor | M.2 2280 B+M key | |
Interface | PCIe 3.0 x2 | |
Protocol | NVMe 1.1 | |
Controller | Intel | |
Memory | 128Gb 20nm Intel 3D XPoint | |
Sequential Read | 900 MB/s | 1350 MB/s |
Sequential Write | 145 MB/s | 290 MB/s |
Random Read | 190k IOPS | 240k IOPS |
Random Write | 35k IOPS | 65k IOPS |
Read Latency | 7µs | 9 µs |
Write Latency | 18µs | 30 µs |
Active Power | 3.5 W | 3.5 W |
Idle Power | 1 W | 1 W |
Endurance | 182.5 TB | 182.5 TB |
Warranty | 5 years | |
MSRP | $44 | $77 |
The performance specifications of Intel Optane Memory have been revised slightly since the announcement last month, with Intel now providing separate performance specs for the two capacities. Given the PCIe x2 link it's no surprise to see that sequential read speeds are substantially lower than we see from other NVMe SSDs, with 900 MB/s for the 16GB model and 1350 MB/s for the 32GB model. Sequential writes of 145 MB/s and 290 MB/s are far slower than consumer SSDs are usually willing to advertise, but are typical of the actual sustained sequential write speed of a good TLC NAND SSD. Random read throughput of 190k and 240k IOPS is in the ballpark for other NVMe SSDs. Random write throughput of 35k and 65k IOPS are also below the peak speeds advertised my most consumer SSDs, but on par with mainstream TLC and MLC SSDs respectively for actual performance at low queue depths.
Really it's the latency specifications where Optane Memory shines: the read latency of 7µs and 9µs for the 16GB and 32GB respectively are slightly better than even the enterprise Optane SSD DC P4800x, while write latency of 18µs and 30µs are just 2-3 times slower. The read latencies are completely untouchable for flash-based SSDs, but the write latencies can be matched by other NVMe controllers, but only because they cache write operations instead of performing them immediately.
The power consumption and endurance specifications don't look as impressive. 3.5W active power is lower than many M.2 PCIe SSDs and low enough that thermal throttling is unlikely to be a problem. The 1W idle power is unappealing, if not a bit problematic. Many M.2 NVMe SSDs will idle at 1W or more if the system is not using PCIe Active State Power Management and NVMe Power States. The Optane Memory doesn't even support the latter and will apparently draw the full 1W even in a well-tuned laptop. Since these power consumption numbers are typically going to be in addition to the power consumption of a mechanical hard drive, an Optane caching configuration is not going to offer decent power efficiency.
Meanwhile write endurance is rated at the same 100GB/day or 182.5 TB total for both capacities. Even though a stress test could burn through all of that in a week or two, 100GB/day is usually plenty for ordinary consumer use. However, a cache drive will likely experience a higher than normal write load as data and applications will tend to get evicted from the cache only to be pulled back in the next time they are loaded. More importantly, Intel promised that 3D XPoint would have on the order of 1000x the endurance of NAND flash, which should put these drives beyond the write endurance of any other consumer SSDs even after accounting for their small capacity.
Intel's Caching History
Intel's first attempt at using solid-state memory for caching in consumer systems was the Intel Turbo Memory, a mini-PCIe card with 1GB of flash to be used by the then-new Windows Vista features Ready Drive and Ready Boost. Promoted as part of the Intel Centrino platform, Turbo Memory was more or less a complete failure. The cache it provided was far too small and too slow—sequential writes in particular were much slower than a hard drive. Applications were seldom significantly faster, though in systems short on RAM, Turbo Memory made swapping less painfully slow. Battery life could sometimes be extended by allowing the hard drive to spend more time spun down in idle. Overall, most OEMs were not interested in adding more than $100 to a system for Turbo Memory.
Intel's next attempt at caching came as SSDs were moving into the mainstream consumer market. The Z68 chipset for Sandy Bridge processors added Smart Response Technology (SRT), a SSD caching mode for Intel's Rapid Storage Technology (RST) drivers. SRT could be used with any SATA SSD but cache sizes were limited to 64GB. Intel produced the SSD 311 and later SSD 313 with low capacity but relatively high performance SLC NAND flash as caching-optimized SSDs. These SSDs started at $100 and had to compete against MLC SSDs that offered multiple times the capacity for the same price—enough that the MLC SSDs were starting to become reasonable options for every general-purpose storage without any hard drive.
Smart Response Technology worked as advertised but was very unpopular with OEMs, and it didn't really catch on as an aftermarket upgrade among enthusiasts. The rapidly dropping prices and increasing capacities of SSDs made all-flash configurations more and more affordable, while SSD caching still required extra work to set up and small cache sizes meant heavy users would still frequently experience uncached application launches and file loads.
Intel's caching solution for Optane Memory is not simply a re-use of the existing Smart Response Technology caching feature of their Rapid Storage Technology drivers. It relies on the same NVMe remapping feature added to Skylake chipsets to support NVMe RAID, but the caching algorithms are tuned for Optane. The Optane Memory software can be downloaded and installed separately without including the rest of the RST features.
Optane Memory caching has quite a few restrictions: it is only supported with Kaby Lake processors and it requires a 200-series chipset or a HM175, QM175 or CM238 mobile chipset. Only Core i3, i5 and i7 processors are supported; Celeron and Pentium parts are excluded. Windows 10 64-bit is the only supported operating system. The Optane Memory module must be installed in a M.2 slot that connects to PCIe lanes provided by the chipset, and some motherboards will also have M.2 slots that do not support Optane Caching or RST RAID. The drive being cached must be SATA, not NVMe, and only the boot volume can be cached. Lastly, the motherboard firmware must have Optane Memory support to boot the cached volume. Motherboards that have the necessary firmware features will feature a UEFI tool to unpair the Optane Memory cache device from the backing device being cached, but this can also be performed with the Windows software.
Many of these restrictions are arbitrary and software enforced. The only genuine hardware requirement seems to be a Skylake 100-series or later chipset. The release notes for the final production release of the Optane Memory and RST drivers even includes in the list of fixed issues the removal of the ability to enable Optane caching with a non-Optane NVMe cache device, and the ability to turn on Optane caching with a Skylake processor in a 200-series motherboard. Don't be surprised if these drivers get hacked to provide Optane caching on any Skylake system that can do NVMe RAID with Intel RST.
Intel's latest caching solution is not being pitched as a way of increasing performance in high-end systems; for that, they'll have full-size Optane SSDs for the prosumer market later this year. Instead, Optane Memory is intended to provide a boost for systems that still rely on a mechanical hard drive. It can be used to cache access to a SATA SSD or hybrid drive, but don't expect any OEMs to ship such a configuration—it won't be cost-effective. The goal of Optane Memory is to bring hard drive systems up to SSD levels of performance for a modest extra cost and without sacrificing total capacity.
Testing Optane Memory
For this review, Intel provided a fully-assembled desktop system with Windows 10 pre-installed and Optane Memory caching configured and enabled. The system was assembled by Intel's Demo Depot Build Center as the equivalent of a typical low to mid-range retail desktop with an i5-7400 processor, a B250 motherboard and 16GB of RAM. Storage is a 1TB 7200RPM WD Black hard drive plus the Optane Memory 32GB module.
Intel Optane Memory Review System | |
CPU | Intel Core i5-7400 |
Motherboard | ASUS B250-PLUS |
Chipset | Intel B250 |
Memory | 2x 8GB Kingston DDR4-2400 CL17 |
Case | In Win C583 |
Power Supply | Cooler Master G550M |
OS | Windows 10 64-bit, version 1607 |
Drivers | Intel Optane Memory version 15.5.0.1051 |
In addition, we tested the Optane Memory's performance and power consumption as a standalone SSD using our own testbed. This allowed us to compare against the Optane SSD DC P4800X and to verify Intel's performance specifications for the Optane Memory.
Unfortunately, this review includes only an abbreviated set of benchmarks, for two reasons: the Optane Memory review system arrived less than a week ago, as I was trying to finish up the P4800X review, and the Optane Memory module did not survive testing. After about a day of benchmarking the Optane Memory review system locked up, and after rebooting the Optane Memory module was not detected and the OS installation was corrupted beyond repair. The drive is not completely dead: Linux can detect it as a NVMe device but cannot use it for storage or even retrieve the drive's error log. In communicating with Intel over the weekend, we were not able to figure out what went wrong, and the replacement module could not be delivered before the publication of this review.
The fact that the Optane Memory module died should not be taken as any serious evidence against the product's reliability. I kill review units once every few months during the course of ordinary testing, and I was due for another failure (ed: it's a bona fide AnandTech tradition). What we call ordinary testing is of course not something that anybody would mistake for just the intended use of the product, and no SSD brand has been entirely free from this kind of problem. However, the fact remains that we don't have as much data to present as we wish, and we don't have enough experience with the product to make final conclusions about it.
For comparison with the Optane Memory caching configuration, we selected the Crucial MX300 525GB and the Samsung 960 EVO 250GB. Both of these are available at retail for slightly less than the price of the Optane Memory 32GB module and the 1TB hard drive. They represent different capacity/performance tradeoffs within the same overall storage budget and are reasonable alternatives to consider when building a system like this Optane Memory review system.
For testing of the Optane Memory caching performance and power consumption, we have SYSmark 2014 SE results. Our synthetic tests of the Optane Memory as a standalone SSD are abbreviated forms of the tests we used for the Optane SSD DC P4800X, with only queue depths up to 16 considered here. Since those tests were originally for an enterprise review, the drives are preconditioned to steady state by filling them twice over with random writes. Our follow-up testing will consider the consumer drives in more ordinary workloads consisting of short bursts of I/O on drives that are not full.
BAPCo SYSmark 2014 SE
BAPCo's SYSmark 2014 SE is an application-based benchmark that uses real-world applications to replay usage patterns of business users in the areas of office productivity, media creation and data/financial analysis. In addition, it also addresses the responsiveness aspect which deals with user experience as related to application and file launches, multi-tasking etc. Scores are meant to be compared against a reference desktop (the SYSmark 2014 SE calibration system in the graphs below). While the SYSmark 2014 benchmark used a Haswell-based desktop configuration, the SYSmark 2014 SE makes the move to a Lenovo ThinkCenter M800 (Intel Core i3-6100, 4GB RAM and a 256GB SATA SSD). The calibration system scores 1000 in each of the scenarios. A score of, say, 2000, would imply that the system under test is twice as fast as the reference system.
SYSmark scores are based on total application response time as seen by the user, including not only storage latency but time spent by the processor. This means there's a limit to how much a storage improvement could possibly increase scores. It also means our Optane review system starts out with an advantage over the SYSmark calibration system due to the faster processor and more RAM.
In every performance category the Optane caching setup is either in first place or a close tie for first. The Crucial MX300 is tied with the Optane configuration for every sub-test except the responsiveness test, where it falls slightly behind. The Samsung 960 EVO 250GB struggles, partly because its low capacity and the low degree of parallelism that implies means it often cannot take advantage of the performance offered by its PCIe 3.0 x4 interface. The use of Microsoft's built-in NVMe driver instead of Samsung's may also be holding it back. As expected, the WD Black hard drive scores substantially worse than our solid-state configurations on every test, with the biggest disparity occurring in the responsiveness test: The WD Black hard drive will force users to spend more than twice as much time waiting on their computer than if it has a SSD.
Energy Usage
SYSmark 2014 SE also adds energy measurement to the mix. A high score in the SYSmark benchmarks might be nice to have, but, potential customers also need to determine the balance between power consumption and the efficiency of the system. For example, in the average office scenario, it might not be worth purchasing a noisy and power-hungry PC just because it ends up with a 2000 score in the SYSmark 2014 SE benchmarks. In order to provide a balanced perspective, SYSmark 2014 SE also allows vendors and decision makers to track the energy consumption during each workload. In the graphs below, we find the total energy consumed by the PC under test for a single iteration of each SYSmark 2014 SE workload and how it compares against the calibration systems.
The peak power consumption of a PCIe SSD under load can exceed the power draw of a hard drive, but over the course of a fixed workload hard drives will always be less power efficient. SSDs almost always complete the data transfer sooner, and they can enter and leave their low-power idle states far quicker. On a benchmark like SYSmark, there are no idle times long enough for a hard drive to spin down and save power.
With an idle power of 1W, the Optane cache module substantially increases the already high power consumption of the hard drive-based configurations. It does allow for the tests to complete sooner, but since the Optane module does nothing to accelerate the compute-bound portions of SYSmark, the total time saved is not enough to make up the difference. It also appears that the Optane caching is not being used to enable more aggressive power saving on the hard drive—Intel's probably flushing writes from the cache often enough to keep the hard drive spinning the whole time. What this adds up to is a difference that's quite clear but not big enough for desktop users to be too concerned with unless their electricity prices are high. The Optane Memory caching configuration is the most power-hungry option we tested, while the second-place performing Crucial MX300 configuration was most efficient, using about 16% less energy overall.
For mobile users, the power consumption of the Optane plus hard drive configuration is pretty much a deal-breaker. Our Optane review system is not optimized for power consumption the way a notebook system would be, so for a mobile user the Optane module would account for an even larger portion of the total battery draw, and battery life will take a serious hit.
Random Read
Random read speed is the most difficult performance metric for flash-based SSDs to improve on. There is very limited opportunity for a drive to do useful prefetching or caching, and parallelism from multiple dies and channels can only help at higher queue depths. The NVMe protocol reduces overhead slightly, but even a high-end enterprise PCIe SSDs can struggle to offer random read throughput that would saturate a SATA link.
Real-world random reads are often blocking operations for an application, such as when traversing the filesystem to look up which logical blocks store the contents of a file. Opening even an non-fragmented file can require the OS to perform a chain of several random reads, and since each is dependent on the result of the last, they cannot be queued.
These tests were conducted on the Optane Memory as a standalone SSD, not in any caching configuration.
Queue Depth 1
Our first test of random read performance looks at the dependence on transfer size. Most SSDs focus on 4kB random access as that is the most common page size for virtual memory systems and it is a common filesystem block size. For our test, each transfer size was tested for four minutes and the statistics exclude the first minute. The drives were preconditioned to steady state by filling them with 4kB random writes twice over.
Vertical Axis scale: | Linear | Logarithmic |
The Optane Memory module manages to provide slightly higher performance than even the P4800X for small random reads, though it levels out at about half the performance for larger transfers. The Samsung 960 EVO starts out about ten times slower than the Optane Memory but narrows the gap in the second half of the test. The Crucial MX300 is behind the Optane memory by more than a factor of ten through most of the test.
Queue Depth >1
Next, we consider 4kB random read performance at queue depths greater than one. A single-threaded process is not capable of saturating the Optane SSD DC P4800X with random reads so this test is conducted with up to four threads. The queue depths of each thread are adjusted so that the queue depth seen by the SSD varies from 1 to 16. The timing is the same as for the other tests: four minutes for each tested queue depth, with the first minute excluded from the statistics.
The SATA, flash NVMe and two Optane products are each clearly occupying different regimes of performance, though there is some overlap between the two Optane devices. Except at QD1, the Optane Memory offers lower throughput and higher latency than the P4800X. By QD16 the Samsung 960 EVO is able to exceed the throughput of the Optane Memory at QD1, but only with an order of magnitude more latency.
Vertical Axis scale: | Linear | Logarithmic |
Comparing random read throughput of the Optane SSDs against the flash SSDs at low queue depths requires plotting on a log scale. The Optane Memory's lead over the Samsung 960 EVO is much larger than the 960 EVO's lead over the Crucial MX300. Even at QD16 the Optane Memory holds on to a 2x advantage over the 960 EVO and a 6x advantage over the MX300. Over the course of the test from QD1 to QD16, the Optane Memory's random read throughput roughly triples.
Mean | Median | 99th Percentile | 99.999th Percentile |
For mean and median random read latency, the two Optane drives are relatively close at low queue depths and far faster than either flash SSD. The 99th and 99.999th percentile latencies of the Samsung 960 EVO are only about twice as high as the Optane Memory while the Crucial MX300 falls further behind with outliers in excess of 20ms.
Random Write
Flash memory write operations are far slower than read operations. This is not always reflected in the performance specifications of SSDs because writes can be deferred and combined, allowing the SSD to signal completion before the data has actually moved from the drive's cache to the flash memory. Consumer workloads consist of far more reads than writes, but there are enough sources of random writes that they also matter to everyday interactive use. These tests were conducted on the Optane Memory as a standalone SSD, not in any caching configuration.
Queue Depth 1
As with random reads, we first examine QD1 random write performance of different transfer sizes. 4kB is usually the most important size, but some applications will make smaller writes when the drive has a 512B sector size. Larger transfer sizes make the workload somewhat less random, reducing the amount of bookkeeping the SSD controller needs to do and generally allowing for increased performance.
Vertical Axis scale: | Linear | Logarithmic |
As with random reads, the Optane Memory holds a slight advantage over the P4800X for the smallest transfer sizes, but the enterprise Optane drive completely blows away the consumer Optane Memory for larger transfers. The consumer flash SSDs perform quite similarly in this steady-state test and are consistently about an order of magnitude slower than the Optane Memory.
Queue Depth >1
The test of 4kB random write throughput at different queue depths is structured identically to its counterpart random write test above. Queue depths from 1 to 16 are tested, with up to four threads used to generate this workload. Each tested queue depth is run for four minutes and the first minute is ignored when computing the statistics.
With the Optane SSD DC P4800X included on this graph, the two flash SSDs are have barely perceptible random write throughput, and the Optane Memory's throughput and latency both fall roughly in the middle of the gap between the P4800X and the flash SSDs. The random write latency of the Optane Memory is more than twice that of the P4800X at QD1 and is close to the latency of the Samsung 960 EVO, while the Crucial MX300 starts at about twice that latency.
Vertical Axis scale: | Linear | Logarithmic |
When testing across the range of queue depths and at steady state, the 525GB Crucial MX300 is always delivering higher throughput than the Samsung 960 EVO, but with substantial inconsistency at higher queue depths. The Optane Memory almost doubles in throughput from QD1 to QD2, and is completely flat thereafter while the P4800X continues to improve until QD8.
Mean | Median | 99th Percentile | 99.999th Percentile |
The Optane Memory and Samsung 960 EVO start out with the same median latency at QD1 and QD2 of about 20µs. The Optane Memory's latency increases linearly with queue depth after that due to its throughput being saturated, but the 960 EVO's latency stays lower until near the end of the test. The Samsung 960 EVO has relatively poor 99th percentile latency to begin with and is joined by the Crucial MX300 once it has saturated its throughput, while the Optane Memory's latency degrades gradually in the face of overwhelming queue depths. The 99.999th percentile latency of the flash-based consumer SSDs is about 300-400 times that of the Optane Memory.
Sequential Read
Sequential access is usually tested with 128kB transfers, which is large enough that requests can typically be striped across multiple controller channels and still involve writing a full page or more to the flash on each channel. Real-world sequential transfer sizes vary widely depending on factors like which application is moving the data or how fragmented the filesystem is.
The drives were preconditioned with two full writes using 4kB random writes, so the data on each drive is entirely fragmented. This may limit how much prefetching of user data the drives can perform on the sequential read tests, but they can likely benefit from better locality of access to their internal mapping tables. These tests were conducted on the Optane Memory as a standalone SSD, not in any caching configuration.
Queue Depth 1
The test of sequential read performance at different transfer sizes was conducted at queue depth 1. Each transfer size was used for four minutes, and the throughput was averaged over the final three minutes of each test segment.
Vertical Axis scale: | Linear | Logarithmic |
The three PCIe drives show similar growth through the small to mid transfer sizes, but the Optane Memory once again has the highest performance for small transfers and higher performance across the board than the Samsung 960 EVO.
Queue Depth > 1
For testing sequential read speeds at different queue depths, we use the same overall test structure as for random reads: total queue depths of up to 64 are tested using a maximum of four threads. Each thread is reading sequentially but from a different region of the drive, so the read commands the drive receives are not entirely sorted by logical block address.
The Samsung 960 EVO and Optane Memory start out with QD1 sequential read performance and latency that is relatively close, but then at higher queue depths the Optane Memory jumps up to a significantly higher throughput.
Vertical Axis scale: | Linear | Logarithmic |
The two Optane devices saturate for sequential reads at QD2, but the Optane Memory experiences a much smaller jump from its QD1 throughput. The flash SSDs are mostly saturated from the start. The Crucial MX300 delivers far lower performance than SATA allows for, due to this test being multithreaded with up to four workers reading from different parts of the drive.
Mean | Median | 99th Percentile | 99.999th Percentile |
Since all four drives are saturated through almost all of this test, the latency graphs are fairly boring: increasing queue depth increases latency. For mean and median latency the Optane Memory and the Samsung 960 EVO are relatively close, but for the 99th and 99.999th percentile metrics the 960 EVO is mostly slower than the Optane Memory by about the same factor of two that the P4800X beats the Optane Memory by.
Sequential Write
The sequential write tests are structured identically to the sequential read tests save for the direction the data is flowing. The sequential write performance of different transfer sizes is conducted with a single thread operating at queue depth 1. For testing a range of queue depths, a 128kB transfer size is used and up to four worker threads are used, each writing sequentially but to different portions of the drive. Each sub-test (transfer size or queue depth) is run for four minutes and the performance statistics ignore the first minute. These tests were conducted on the Optane Memory as a standalone SSD, not in any caching configuration.
Vertical Axis scale: | Linear | Logarithmic |
The enterprise-focused Optane SSD P4800X is slower than the consumer Optane Memory for sequential writes of less than 4kB, and even the Samsung 960 EVO beats the P4800X at 512B transfers. The 960 EVO's performance is inconsistent through the second half of the test but on average it is far closer to the MX300 than either Optane device. For larger transfers the MX300 is about a tenth the speed of the Optane Memory.
Queue Depth > 1
The sequential write throughput of the Optane SSD DC P4800X dwarfs that of the other three drives, even the Optane Memory. The Optane Memory does provide substantially higher throughput than the flash SSDs, but it does not have a latency advantage for sequential writes.
Vertical Axis scale: | Linear | Logarithmic |
The Crucial MX300 is the only drive that does not get a throughput boost going from QD1 to QD2; as with the random write test it is not able to improve performance when the higher queue depth is due to multiple threads writing to the drive. The Samsung 960 EVO improves from the addition of a second thread but beyond that it simply gets more inconsistent. The Optane Memory and P4800X are both very consistent and saturated at QD2 after a moderate improvement from QD1.
Mean | Median | 99th Percentile | 99.999th Percentile |
The flash SSDs get more inconsistent with increased thread count and queue depth, but other than that the latency charts show the predictable growth in latency that comes from the drives all being saturated in terms of throughput.
Mixed Read/Write Performance
Workloads consisting of a mix of reads and writes can be particularly challenging for flash based SSDs. When a write operation interrupts a string of reads, it will block access to at least one flash chip for a period of time that is substantially longer than a read operation takes. This hurts the latency of any read operations that were waiting on that chip, and with enough write operations throughput can be severely impacted. If the write command triggers an erase operation on one or more flash chips, the traffic jam is many times worse.
The occasional read interrupting a string of write commands doesn't necessarily cause much of a backlog, because writes are usually buffered by the controller anyways. But depending on how much unwritten data the controller is willing to buffer and for how long, a burst of reads could force the drive to begin flushing outstanding writes before they've all been coalesced into optimal sized writes.
This mixed workload test is an extension of what Intel describes in their specifications for the Optane SSD DC P4800X. A total queue depth of 16 is achieved using four worker threads, each performing a mix of random reads and random writes. Instead of just testing a 70% read mixture, the full range from pure reads to pure writes is tested at 10% increments. These tests were conducted on the Optane Memory as a standalone SSD, not in any caching configuration. Client and consumer workloads do consist of a mix of reads and writes, but never at queue depths this high; this test is included primarily for comparison between the two Optane devices.
Vertical Axis units: | IOPS | MB/s |
At the beginning of the test where the workload is purely random reads, the four drives almost form a geometric progression: the Optane Memory is a little under half as fast as the P4800X and a little under twice as fast as the Samsung 960 EVO, and the MX300 is about a third as fast as the 960 EVO. As the proportion of writes increases, the flash SSDs lose throughput quickly. The Optane Memory declines across the entire test but gradually, ending up at a random write speed around one fourth of its random read speed. The P4800X has enough random write throughput to rebound during the final phases of the test, ending up with a random write throughput almost as high as the random read throughput.
Mean | Median | 99th Percentile | 99.999th Percentile |
The flash SSDs actually manage to deliver better median latency than the Optane Memory through a portion of the test, after they've shed most of their throughput. For the 99th and 99.999th percentile latencies, the flash SSDs perform much worse once writes are added to the mix, ending up almost 100 times slower than the Optane Memory.
Idle Power Consumption
There are two main ways that a NVMe SSD can save power when idle. The first is through suspending the PCIe link through the Active State Power Management (ASPM) mechanism, analogous to the SATA Link Power Management mechanism. Both define two power saving modes: an intermediate power saving mode with strict wake-up latency requirements (eg. 10µs for SATA "Partial" state) and a deeper state with looser wake-up requirements (eg. 10ms for SATA "Slumber" state). SATA Link Power Management is supported by almost all SSDs and host systems, though it is commonly off by default for desktops. PCIe ASPM support on the other hand is a minefield and it is common to encounter devices that do not implement it or implement it incorrectly, especially among desktops. Forcing PCIe ASPM on for a system that defaults to disabling it may lead to the system locking up.
The NVMe standard also defines a drive power management mechanism that is separate from PCIe link power management. The SSD can define up to 32 different power states and inform the host of the time taken to enter and exit these states. Some of these power states can be operational states where the drive continues to perform I/O with a restricted power budget, while others are non-operational idle states. The host system can either directly set these power states, or it can declare rules for which power states the drive may autonomously transition to after being idle for different lengths of time. NVMe power management including Autonomous Power State Transition (APST) fortunately does not depend on motherboard support the way PCIe ASPM does, so it should eventually reach the same widespread availability that SATA Link Power Management enjoys.
We report two idle power values for each drive: an active idle measurement taken with none of the above power management states engaged, and an idle power measurement with either SATA LPM Slumber state or the lowest-power NVMe non-operational power state, if supported. These tests were conducted on the Optane Memory as a standalone SSD, not in any caching configuration.
With no support for NVMe idle power states, the Optane Memory draws the rated 1W at idle while the SATA and flash-based NVMe drives drop to low power states with a tenth of the power draw or less. Even without using low power states, the Crucial MX300 uses a fraction of the power, and the Samsung 960 EVO uses only 150mW more to keep twice as many PCIe lanes connected.
The Optane Memory is a tough sell for anyone concerned with power consumption. In a typical desktop it won't be enough to worry about, but Intel definitely needs to add proper power management to the next iteration of this product.
First Thoughts
Since our Optane Memory sample died after only about a day of testing, we cannot conduct a complete analysis of the product or make any final recommendations. With that said, the early indications from the benchmarks we were able to complete are mostly very positive reflections of the performance of the Intel Optane Memory.
As a cache device, the Optane Memory brought a hard drive-based system's SYSmark scores up to the level of mainstream SSDs. These averages do not capture differences in the latency distributions of the Optane cache+hard drive configuration vs a flash SSD. In the Optane+hard drive configuration, a cache hit will be almost 1000 times faster than a cache miss, resulting in a very bimodal distribution. The flash SSDs mostly occupy the territory between the performance of Optane and of the hard drive. It's possible that a mainstream flash SSD could deliver a user experience with fewer noticeable delays than the Optane caching experience with the occasional inevitable cache miss. Overall, however, the Optane cache delivers a remarkable improvement over just a hard drive, and the 32GB cache capacity we tested is clearly large enough to be of substantial use.
As a standalone drive, the Optane Memory breaks a few records that were set by the Intel Optane SSD DC P4800X enterprise drive just last week. The Optane Memory is more tuned for small transfer sizes and offers even better QD1 random read performance. These differences seem like exactly the right optimizations to make for a drive focused on client workloads. The throughput at higher queue depths is nowhere near what the P4800X delivers and falls behind what more expensive consumer SSDs can offer, but those situations make up a very small portion of client workloads. The first and only batch of synthetic tests we were able to run on the Optane Memory were derived from the enterprise SSD tests used on the Optane SSD DC P4800X, and they cast the consumer flash SSDs in an unrealistically bad light. A typical desktop user has little reason to care how well their SSD handles multiple threads performing sustained sequential transfers on a full drive, so the Optane Memory's stellar performance there should not lead users to prefer an Optane cached hard drive setup over an all solid state configuration.
The one area where we are ready to draw some conclusions is power consumption. We still need to conduct further analysis of the Optane Memory's power use under load, but its idle power situation is simple: the Optane Memory lacks any meaningful power saving mode. It is rated for 1W at idle and that's the lowest we saw it get throughout our short time testing it. 1W is something desktop users can shrug off; a typical gaming desktop dedicates more power than that to decorative LEDs. But Optane Memory is also intended for mobile use, and the first systems announced to offer Optane Memory were Lenovo ThinkPads. Adding a minimum of 1W on top of the power drawn by a mechanical hard drive will not help battery life, no matter how much faster it makes the storage system.
With Optane Memory, Intel seems to finally have the cache device they've been needing for a decade to make SSD caching viable. It's fast in spite of its low capacity, something flash based cache devices could never pull off. Optane Memory is also more affordable at $44 and $77 than Intel's previous cache devices.
With that said, however, I wonder whether it may all be too little, too late. SSD caching has some unavoidable limitations: cold caches, cache evictions when the cache proves too small, and the added complexity of a tiered setup. With those disadvantages, Optane Memory enters a market where the price of flash SSDs means there's already very little reason for consumer machines to use a mechanical hard drive as primary storage. Instead, the best case scenario here appears to be enabling the capacity benefits of tiered storage - offering nimble systems with 1TB+ of cheap storage and is presented to the user as a single drive - but without as many of the drawbacks of earlier NAND-based caches.
In some sense, Optane Memory may just be a stop-gap product for the consumer market until Intel is able to deliver usefully large Optane SSDs for consumers. But those SSDs are likely to arrive with prohibitively high prices if they ship later this year as planned. 3D XPoint memory has arrived and is poised to revolutionize parts of the enterprise storage market, but it may not be ready to have a meaningful impact on the consumer market.