Need help on how to profile cache accesses

Hello, everyone!

I am running some tests using a Banana Pi BPI-F3, I am facing some issues when trying to profile the cache accesses of my application. I have Bianbu 3.0.1 installed on the board.

For profiling the application I am using perf, the problem is that I am getting a couple of errors that I was not able to solve. For context, here is the perf --version and perf list | grep "Hardware":

  • perf version 6.6.63.gd5b2ef2c6af4
  • branch-instructions OR branches                    [Hardware event]
    branch-misses                                      [Hardware event]
    bus-cycles                                         [Hardware event]
    cache-misses                                       [Hardware event]
    cache-references                                   [Hardware event]
    cpu-cycles OR cycles                               [Hardware event]
    instructions                                       [Hardware event]
    ref-cycles                                         [Hardware event]
    stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
    stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
    mem:<addr>[/len][:access]                          [Hardware breakpoint]
    
    

I have set sysctl -w kernel.perf_event_paranoid=0. The first error that I encountered is that perf record [...] is not working, when I try to read the report I get the error: “The perf.data data has no samples!”

The second error that I encounter is perf is not counting some events, most importantly for me the cache-references / cache-misses events, as can be seen in the following result:

# perf stat -e cycles,instructions,cache-misses,branches,branch-misses sleep 10

 Performance counter stats for 'sleep 10':

           4996191      cycles                                                                  (59.06%)
           1521949      instructions                     #    0.30  insn per cycle            
     <not counted>      cache-misses                                                            (0.00%)
     <not counted>      branches                                                                (0.00%)
     <not counted>      branch-misses                                                           (0.00%)

      10.005342635 seconds time elapsed

       0.005443000 seconds user
       0.000000000 seconds sys

Am I doing something wrong? Also I accept suggestions if there is a better way to profile the cache accesses of an algorithm. I think it is worth noting that I am fairly new to risc-v development.