12-09-2016 05:35 PM
I recently purchased a m900 Tiny and installed an Intel 600p NVMe SSD in it. It immediately started showing errors in both Windows 10 and Linux 4.8, and did so even with another SSD in place of it, and even after Lenovo support replaced the motherboard.
It appears to be a bug in UEFI, prior to v45 the issue did not show up under Linux 4.8 but it still did under Windows. UEFI v5F claims to have fixed this issue, but unfortunately did not actually do so under Windows 10 or Linux 4.8.
CHANGES for FWKT5FA - [Important] Update includes security fixes. - Fix Windows Event Viewer reports WHEA_Logger Event ID 17 error with some kind of configurations. - Support IPV4 and IPV6 boot when Boot Mode is set as "UEFI Only".
I tried getting phone support to route me to someone that I could report the issue in the UEFI firmware to for nearly an hour and many transfers later they finally said they were not able to do so and that there was nothing else they could do. This is obviously unacceptable, why is it that they can't report issues to the UEFI team? I work in level 3 support, and telling the customer 'sorry' we can't tell someone else in our own company about a major issue would get a serious reprimand.
Under Linux 4.8 the issue shows up like the following, repeated thousands of times per second:
Dec 6 22:53:53 ubuntu-mate kernel: [ 5.550386] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
Dec 6 22:53:53 ubuntu-mate kernel: [ 5.550403] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
Dec 6 22:53:53 ubuntu-mate kernel: [ 5.550449] pcieport 0000:00:1b.0: device [8086:a167] error status/mask=00000001/00002000
Dec 6 22:53:53 ubuntu-mate kernel: [ 5.550482] pcieport 0000:00:1b.0: [ 0] Receiver Error
Under Windows 10 the issue looks like the file attached, repeated tens of thousands of times per second.
12-16-2016 08:46 PM
I looked into this issue some more and I suspect what might be happening is that the PCIe AERs are being trigged on NVMe SSDs due to them overheating in the m900 Tiny. I'm not entirely sure why they overheat in it but not in my ThinkPad Yoga 260 though, as its even more cramped... Using the Linux 'nvme' and 'smartctl' utilities I can check the drive temperature, the 600p in my m900 idles at about 60C, and in my Yoga 260 it idles at about 21C, thats a huge difference. The warning threshold for the drive is 70C.
I ordered two m.2 heatsinks off eBay from China to put on them, apparently it can help by about 10C or so. It will take 3-4 weeks to arrive, I'll try to remember to post a follow up here about the PCIe AER messages.
01-04-2017 04:40 AM
Lenovo Engineering is looking at this report of a UEFI bug. They wanted to know what was the second drive you mentioned these errors occurring against in the first post's paragraph?
01-10-2017 09:13 AM
In response to customer reports of WHEA_Logger Event Viewer ID 17 errors on M900 Tiny systems, Lenovo is scheduled to release a BIOS update, BIOS v.67, by the end of January.
In lab testing, Lenovo confirmed generation of the errors on an Intel drive. These were resolved by the beta BIOS.
I will post again when the BIOS has been published.
01-18-2017 05:47 AM
Lenovo has released a BIOS update that resolves customer reports of WHEA_Logger Event Viewer ID 17 errors on M700, M800, M900, and M900x Tiny systems.
Add support for some new models of M.2 SSD on TINY.
BIOS v65 can be downloaded here: http://support.lenovo.com/us/en/downloads/DS105487
01-21-2017 11:18 PM - edited 01-21-2017 11:24 PM
I'm sorry I somehow missed that I got a follow up to this message.
Both drives are Intel 600p 1TB and neither of them show errors when installed in a ThinkPad Yoga 260. I updated to the current BIOS tonight (FWKT65A) and rebooted into Windows and it still shows the same errors at the same rate (~ 16K/sec). I am attaching what the current error looks like, I believe it is essentially the same as the previous one.
Also, I should note that I finally got the m.2 heatsinks I ordered (shipped from China so it took a while) and installed them on the drives which dropped the temperature by 10C so its no longer anywhere near the warning threshold and its still showing the errors. So it appears the PCIe AER errors are not related to the temperature of the drives.
01-22-2017 04:57 AM
There appears to have been a firmware update to the Intel 600p in mid-Dec., do you know if it has been applied?
This release includes a firmware update for the Intel® SSD 600p, Intel® SSD Pro 6000p, Intel® SSD E 6000p, Intel® SSD DC P3100, Intel® SSD DC S3610, and the Intel® SSD DC S3710 Series products. Users on Windows 7 or 8.1 must use the Intel® SSD Firmware Update Tool to receive the firmware update.
- For the Intel® SSD 600p Series, the latest firmware revision is 109C.
01-22-2017 01:39 PM
Yes, the drive already has the current firmware PSF109C and it did not help.
01-22-2017 05:01 PM - edited 01-22-2017 05:48 PM
Thanks for checking. I didn't anticipate a change regarding the errors, but hoped for some operational optimization that would reduce power consumption (and heat generation.)
01-24-2017 11:26 PM
Tonight I was able to verify that the v67 beta resolves the problem.
Thanks BiggAl, Amy_Lenovo, and the engineering team for all the help!