these kinds of problems may only get worse if circuit geometries keep shrinking. I've wondered for a long time how modern chips, for the most part, have been pretty durable.
Yup, same. I heard that Intel used to have some kind of durability policy that their process nodes had to meet. Something like their processors having to withstand 10 years of continuous operation, or something like that. Obviously, with Raptor Lake, we see that even if that policy still applied to the design of the process node, the follow-through was lacking to ensure the finished CPUs would hold to that standard.
I'm having trouble locating a good source on that 10-year thing, but you can see it alluded to in this plot of EM (Electron Migration) rates for different interconnect materials:
@thestryker , do you have any source on this "10 year" standard, for their process nodes?
AFAIK, the main thing that would shorten the life of motherboards is the capacitors failing. Higher-end motherboards tend to use better caps.
Same. Also, there's been a mini arms race between PSU manufacturers on their warranty periods, for higher-end models. I've seen some with warranties as long as 12 years!
I have Noctua fans running probably 50% of the time, for over a decade, and no failures. But fans are no big deal to replace, if one dies on you.
now with these micro-connects between chips. One cosmic ray and blammo, you lose a core, or at least tolerance for highest frequency?
I'm not sure they're that sensitive. Anyway, cosmic rays are quite rare, at least at sea level.
Plus, the amount of metal it has to get through is non-trivial. I'm no expert on the subject, but if that were a frequent occurrence, it should make it difficult for manufacturers to offer a decent warranty on CPUs. Speaking of which, I wonder what the warranty is like on Xeons and EPYCs, these days.
If it's going to spread may have to build in these kinds of self-tests and dropping a core or trying reduced clock in user-friendly bios or OS or super-tuner-app or something.
Funny you should mention that, because Intel has recently added a feature called "In-Field Scan" to its server CPUs. That could be more of a necessity with high core-count CPUs, than anything else.