db1138 crashed mysql due to memory HW issues:
[33134188.608450] mce: [Hardware Error]: Machine check events logged [33134188.608477] mce: Uncorrected hardware memory error in user-access at 7d3c38f580 [33134188.615864] {1}Hardware error detected on CPU2 [33134188.615874] {1}event severity: recoverable [33134188.615875] {1} Error 0, type: recoverable [33134188.615876] {1} fru_text: B4 [33134188.615876] {1} section_type: memory error [33134188.615877] {1} error_status: 0x0000000000000400 [33134188.615878] {1} physical_address: 0x0000007d3c38f580 [33134188.615880] {1} node: 3 card: 0 module: 0 rank: 0 bank: 1 row: 55982 column: 1016 [33134188.615882] {1} DIMM location: not present. DMI handle: 0x0000 [33134188.617181] Memory failure: 0x7d3c38f: Killing mysqld:163407 due to hardware memory corruption [33134188.626049] Memory failure: 0x7d3c38f: recovery action for dirty LRU page: Recovered [33134263.297543] MCE: Killing mysqld:163468 due to hardware memory corruption fault at 7feced3dc580
05/27/2020 20:20:26 Critical: "Multi-bit memory errors detected on a memory device at location(s) DIMM_B4." in SEL on db1138
What I have done for now is:
- Decreased buffer pool size to 300GB and restarted mysql.
Let's do a master failover on Friday to the candidate master.
@wiki_willy can we get a new DIMM for this host?