Pipeline processor built from scratch

Posted by : on

Category : computer-architecture


repos

Current Progress

  • [Done] Pipeline processor w/ static branch prediction
  • [Done] Multi-level cache
  • [Done] Fully customizable cache
  • [Done] Branch predictors, BTB, RAS
  • [Done] Prefetchers


Statistics & Observations

Prefetchers and Predictors

Context: Default 2-level cache, Conservative prefetcher, Baseline Predictor

  • Baseline configuration
      stop time is          31500260000
      IPC: 0.236838
      L1 I Cache:    1052928 hits,        656 misses,    2111237 cycles,      8.203 penalty
      L2 I Cache:        556 hits,        100 misses,       3414 cycles,     23.020 penalty
      L1 D Cache:      73868 hits,         68 misses,     148678 cycles,     13.853 penalty
      L2 D Cache:         84 hits,         24 misses,        738 cycles,     23.750 penalty
      Predictor :    149981 misses for     211564 branch instr.
      Prefetcher:      5318 prefetches
      $finish at simulation time          31538320000
    

Context Default 2-level cache, Conservative prefetcher, Global Predictor w/ 1-Way BTB

  • Shows that 1-way BTB is better than both Baseline and and 4-way BTB on Coremark.
  • Shows that predictors can be beneficial to Coremark.
      stop time is          27107780000
      IPC: 0.275292
      L1 I Cache:     906514 hits,        603 misses,    1818190 cycles,      8.561 penalty
      L2 I Cache:        501 hits,        102 misses,       3354 cycles,     23.059 penalty
      L1 D Cache:      73868 hits,         68 misses,     148678 cycles,     13.853 penalty
      L2 D Cache:         84 hits,         24 misses,        738 cycles,     23.750 penalty
      Predictor :     76781 misses for     211564 branch instr.
      Prefetcher:      5290 prefetches
      $finish at simulation time          27143740000
    

Context: Default 2-level cache, Conservative prefetcher, Local predictor w/ 4-way BTB

  • Shows that Local predictor on Coremark is better than Baseline.
      stop time is          31109120000
      IPC: 0.239635
      L1 I Cache:    1039890 hits,        656 misses,    2085161 cycles,      8.203 penalty
      L2 I Cache:        556 hits,        100 misses,       3414 cycles,     23.020 penalty
      L1 D Cache:      73868 hits,         68 misses,     148678 cycles,     13.853 penalty
      L2 D Cache:         84 hits,         24 misses,        738 cycles,     23.750 penalty
      Predictor :    143462 misses for     211564 branch instr.
      Prefetcher:      5318 prefetches
      $finish at simulation time          31147180000
    

Context: Default 2-level cache, Conservative prefetcher, Local predictor w/ 1-way BTB

  • Shows that conservative prefetcher is worse than the case without prefetchers
  • since we have few cache misses and the prefetcher won’t positively affect the performance.
  • The degradation is due to additional cache hit latency.
      stop time is          27055390000
      IPC: 0.275790
      L1 I Cache:     901272 hits,        596 misses,    1807650 cycles,      8.567 penalty
      L2 I Cache:        495 hits,        101 misses,       3319 cycles,     23.059 penalty
      L1 D Cache:      73868 hits,         68 misses,     148678 cycles,     13.853 penalty
      L2 D Cache:         84 hits,         24 misses,        738 cycles,     23.750 penalty
      Predictor :     76782 misses for     211652 branch instr.
      Prefetcher:         0 prefetches
      $finish at simulation time          27091310000
    

Context: Default 2-level cache, Conservative prefetcher, Global predictor w/ 1-way BTB

  • Shows that Local predictor is better than Baseline but is inferior to Global predictor.
    stop time is          27831170000
    IPC: 0.268125
    L1 I Cache:     927138 hits,        593 misses,    1859346 cycles,      8.550 penalty
    L2 I Cache:        493 hits,        100 misses,       3292 cycles,     23.060 penalty
    L1 D Cache:      73868 hits,         68 misses,     148678 cycles,     13.853 penalty
    L2 D Cache:         84 hits,         24 misses,        738 cycles,     23.750 penalty
    Predictor :     89694 misses for     211652 branch instr.
    Prefetcher:         0 prefetches
    $finish at simulation time          27867030000

Parameterized Cache

Parameter search space

  • Word size: {128, 256, 512, 1024}
  • L1 cache: 16 sets, 4 ways (fixed at CP1, CP2 baseline)
  • L2 cache: {16, 32, 64, 128, 256} sets, {4, 8, 16} ways

Important Notes

  • IPCs are lower because these experiments use a commit less optimized in prefetching and branch prediction.
  • Experiments are extensive, so refer to the appendix for detailed logs.
  • Here we only select configurations with a 128KB budget for L2 cache.


Baseline: L1 cache only, 256 wordsize, 16 sets, 4 ways

    stop time is          32524050000
    IPC: 0.229373
    L1 I Cache:    1173571 hits,       1280 misses,    2361298 cycles,     11.059 penalty
    L1 D Cache:      73801 hits,        135 misses,     149582 cycles,     14.667 penalty
    $finish at simulation time          32566770000

128 wordsize, 64 sets, 16 ways

  • Miss penalty is low, but L1 cache miss rate is high.
  • Therefore, performance degrades compared to the baseline.
      stop time is          32776490000
      IPC: 0.227675
      L1 I Cache:    1173862 hits,       8476 misses,    2395276 cycles,      5.610 penalty
      L2 I Cache:       7740 hits,        736 misses,      22125 cycles,      9.029 penalty
      L1 D Cache:      73361 hits,        575 misses,     151694 cycles,      8.647 penalty
      L2 D Cache:        784 hits,        171 misses,       3247 cycles,      9.819 penalty
      $finish at simulation time          32824340000
    

256 wordsize, 128 sets, 4 ways

  • L1 I cache miss is dramatically reduced. This is the optimal configuration given the budget.
  • This advantage is gained at the expense of an exceptional number of sets.
      stop time is          32477390000
      IPC: 0.229732
      L1 I Cache:    1173378 hits,       1280 misses,    2356592 cycles,      7.684 penalty
      L2 I Cache:        900 hits,        380 misses,       5997 cycles,     11.045 penalty
      L1 D Cache:      73801 hits,        135 misses,     149261 cycles,     12.289 penalty
      L2 D Cache:        106 hits,         87 misses,       1254 cycles,     11.977 penalty
      $finish at simulation time          32520420000
    

512 wordsize, 16 sets, 16 ways OR 512 wordsize, 32 sets, 8 ways

  • Larger word size introduces heavier miss penalty, which is starting to outweigh the benefit of reducing misses.
  • So we would rather go for the fine-grained approach with smaller words and more sets.
      stop time is          33299490000
      IPC: 0.223949
      L1 I Cache:    1259468 hits,        505 misses,    2524084 cycles,     10.194 penalty
      L2 I Cache:        305 hits,        200 misses,       3634 cycles,     15.120 penalty
      L1 D Cache:      73890 hits,         46 misses,     148641 cycles,     18.717 penalty
      L2 D Cache:          2 hits,         45 misses,        723 cycles,     15.978 penalty
      $finish at simulation time          33341460000
    

1024 wordsize, 16 sets, 8 ways OR 1024 wordsize, 32 sets, 4 ways

  • Penalty is almost doubled and we are worse off!
  • Although the 17252KB Coremark instructions finally seems to fit in the L1 I cache.
      stop time is          33378390000
      IPC: 0.223395
      L1 I Cache:    1269162 hits,        139 misses,    2541183 cycles,     20.568 penalty
      L2 I Cache:         36 hits,        103 misses,       2443 cycles,     23.019 penalty
      L1 D Cache:      73912 hits,         24 misses,     148481 cycles,     27.375 penalty
      L2 D Cache:          0 hits,         24 misses,        585 cycles,     24.375 penalty
      $finish at simulation time          33419590000
    

About moomoohorse
moomoohorse

Hi I a Hao Ren. Please know me in mywebsite

Email : haor2@illinois.edu

Website : http://moomoohorse.com

About moomoohorse

Hi, my name is Hao Ren.