Hello,
I am dealing with the “hard restarts” a while whenever high IOPS for a few of months. I improved cooling, checked power stability, re-configured services to spare as much memory free as possible to prevent OOM, discoveren non-stable cable/sata port, and did about trillion of tests. Some of the latest was discussed here:
https://forum.banana-pi.org/t/bpi-r4-solved-power-consumption-of-m-2-to-sata-adapters
When the sfp nas was set-up and run for the very first time, I was able to copy 10-20GB of data between ssd array and hdd array, no matter the direction. Then I was testing and figuring out many possibilities mentioned above.
Yesterday I was successfully migrate 14TB of data from old hdd array to the new hdd array without a single restart or even a single message in journal.Therefore I think that all mentioned problems was fixed. But, evening another hard restart occured when rsync -rcn was already running a few of hours to verify all the migrated data. But in this case, I can see for the very first time in journal:
Feb 28 21:17:01 nas CRON[3050128]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 28 21:17:01 nas CRON[3050132]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 28 21:17:01 nas CRON[3050128]: pam_unix(cron:session): session closed for user root
Feb 28 21:17:49 nas kernel: Unable to handle kernel paging request at virtual address ffffff8112c31cc0
Feb 28 21:17:49 nas kernel: Mem abort info:
Feb 28 21:17:49 nas kernel: ESR = 0x0000000096000145
Feb 28 21:17:49 nas kernel: EC = 0x25: DABT (current EL), IL = 32 bits
Feb 28 21:17:49 nas kernel: SET = 0, FnV = 0
Feb 28 21:17:49 nas kernel: EA = 0, S1PTW = 0
Feb 28 21:17:49 nas kernel: FSC = 0x05: level 1 translation fault
Feb 28 21:17:49 nas kernel: Data abort info:
Feb 28 21:17:49 nas kernel: ISV = 0, ISS = 0x00000145, ISS2 = 0x00000000
Feb 28 21:17:49 nas kernel: CM = 1, WnR = 1, TnD = 0, TagAccess = 0
-- Boot b5da3d7a08af48cc80f3acc2520cc7a1 --
There is not any other error message, nor OOM.
And, to be assured, there is nothing in the cron.hourly to be run:
root@nas:~# ls -l /etc/cron.hourly/
total 0
When I try to search for these errors, there was many recommendations to verify hw functionality of the memory itself using memtest or something. Does anyone more experienced could confirm this or have any other ideas?