Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 994 evaluations per sample.
 Range (minmax):  30.540 ns52.372 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     31.568 ns               GC (median):    0.00%
 Time  (mean ± σ):   31.853 ns ±  1.444 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▁▆█                                                   
  ▂▂▂▄███▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▁▁▂▂▁▁▁▁▁▁▁▁▂▁▁▂▁▁▂▂▂▂▂▂▂▂ ▃
  30.5 ns         Histogram: frequency by time          39 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 646 evaluations per sample.
 Range (minmax):  191.054 ns416.351 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     201.026 ns                GC (median):    0.00%
 Time  (mean ± σ):   202.670 ns ±   6.518 ns   GC (mean ± σ):  0.00% ± 0.00%

                     ▄█▇                                       
  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▇████▆▅▃▂▂▂▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  191 ns           Histogram: frequency by time          219 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.85 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  371.882 ns608.576 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     380.562 ns                GC (median):    0.00%
 Time  (mean ± σ):   384.895 ns ±  13.312 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▄█▆▂  ▂                                             
  ▂▂▂▄▆████▆▃▃▅███▅▄▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▂▂▁▂▂▃▃▃▃▃▂▂▂▂▂▃▃▃▃▂▂▂ ▃
  372 ns           Histogram: frequency by time          426 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.454 μs  7.939 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.478 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.524 μs ± 226.949 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                   ▁▁   ▁
  ██▃▁▆▇▃▁▄▃▁▃▃▆▅▄▅▄▃▄▁▁▁▃▁▁▄▁▁▁▃▁▁▁▃▁▄▄▁▄▁▅▇▁▃▁▄▃▃▄▁▅█████ █
  4.45 μs      Histogram: log(frequency) by time      5.57 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.675 μs 40.652 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.703 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.799 μs ± 646.844 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▁▇▁▁▁▁▁▄▅▄▄▄▄▃▁▃▁▃▁▁▁▃▃▃▁▁▄▄▆▆▁▃▄▁▁▁▃▃████▇▆▆▅▅▆▄▄▅▅▆▅▆ █
  6.67 μs      Histogram: log(frequency) by time      8.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.85 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  379.921 ns564.650 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.335 ns                GC (median):    0.00%
 Time  (mean ± σ):   389.635 ns ±  11.605 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▁▆█                                                  
  ▂▂▂▃▅████▆▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▁▁▁▁▁▁▂▁▁▂▁▁▂▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  380 ns           Histogram: frequency by time          430 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.040 μs 2.685 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.052 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.062 μs ± 82.468 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                        ▂
  █▃▄▃▄█▅▁▁▁▃▁▃▁▄▄▄▃▃▄▄▅▃▃▄▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆ █
  1.04 μs      Histogram: log(frequency) by time     1.75 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.275 μs  9.221 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.310 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.360 μs ± 254.121 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆█                                                   ▁▁   ▁
  ██▄▆█▇▁▁▁▁▁▅▆▄▆▄▃▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▅▁▁▁▁▁▃▃▁▁▃▁▁▁▁▅▇████ █
  5.27 μs      Histogram: log(frequency) by time      6.56 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.230 μs 12.093 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.248 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.333 μs ± 357.525 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                   ▃                                      ▁
  █▁█▃▃▁▁▁▄▅▆▅▄▃▄▃▁▇█▃▃▃▁▄▁▁▁▁▁▃▃▃▁▁▆▁▃▁▁▁▁▄▄▇██▇▆▅▆▅▆▅▅▆▅ █
  6.23 μs      Histogram: log(frequency) by time      8.05 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  134.311 μs295.102 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     137.126 μs                GC (median):    0.00%
 Time  (mean ± σ):   138.942 μs ±   5.463 μs   GC (mean ± σ):  0.00% ± 0.00%

    ▂▄▆██▇▅▄▂▁          ▁▃▄▄▄▂▂▁    ▁                         ▂
  ▄▇███████████▇▇▄▄▅▄▅▅████████████████▇▇▆▅▆▄▅▂▅▅▄▄▄▃▅▅▅▅▄▅▄▆ █
  134 μs        Histogram: log(frequency) by time        158 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.85 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 988 evaluations per sample.
 Range (minmax):  47.133 ns80.394 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     48.380 ns               GC (median):    0.00%
 Time  (mean ± σ):   48.847 ns ±  2.040 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▂▇▇█                                                  
  ▂▂▄▆████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▁▁▁▁▂▁▁▁▂▁▂▁▂▂▂▂▂▃▂▃▃▂▃▂▂▂▂▂ ▃
  47.1 ns         Histogram: frequency by time        56.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 233 evaluations per sample.
 Range (minmax):  318.708 ns558.124 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     326.103 ns                GC (median):    0.00%
 Time  (mean ± σ):   329.477 ns ±  12.183 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▁▅██▇                                                
  ▁▁▁▃▅███████▆▄▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁ ▂
  319 ns           Histogram: frequency by time          365 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.715 μs  3.782 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.724 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.744 μs ± 115.407 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                         ▁
  █▅▄▄▇▆▃▁▁▁▃▃▄▄▅▃▁▄▁▁▇█▃▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▃▁▃▃▁▁▃▄▁▁▁▁▁▄▅▆██ █
  1.72 μs      Histogram: log(frequency) by time      2.48 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.308 μs 3.357 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.319 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.330 μs ± 91.767 ns   GC (mean ± σ):  0.00% ± 0.00%

  █ ▂▁▁▂▂▂▁▁▁▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▂▁▁▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
  1.31 μs        Histogram: frequency by time        2.02 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.638 μs  5.416 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.738 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.762 μs ± 142.825 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▁▆                                                    
  ▂▂▃▅██▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▂▁▁▁▂▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂ ▃
  2.64 μs         Histogram: frequency by time        3.55 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.920 μs 24.499 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.043 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.176 μs ± 790.273 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅▇  ▁                                   ▁▁                ▂
  ██▇█▅▁▁▄▄▅▅▆▄▁▁▁▁▁▁▃▁▅▇▃▁▁▃▁▁▁▁▁▃▁▁▁▃▅████▇▅▄▄▅▄▄▄▃▄▅▄▄▄ █
  8.92 μs      Histogram: log(frequency) by time      12.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 545 evaluations per sample.
 Range (minmax):  209.365 ns278.283 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     212.361 ns                GC (median):    0.00%
 Time  (mean ± σ):   214.091 ns ±   4.970 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▆█                                                
  ▂▂▂▂▃▄████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▁▁▂▁▁▁▂▂▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂ ▃
  209 ns           Histogram: frequency by time          230 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 180 evaluations per sample.
 Range (minmax):  585.489 ns892.506 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     602.294 ns                GC (median):    0.00%
 Time  (mean ± σ):   607.225 ns ±  15.923 ns   GC (mean ± σ):  0.00% ± 0.00%

             ▃▇█▆▃▁                                            
  ▂▁▁▁▁▁▂▂▃▄▆██████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▁▁▁▂▂▂▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂ ▃
  585 ns           Histogram: frequency by time          654 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.256 μs 17.372 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.439 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.519 μs ± 485.763 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▁ ▁▇                                                ▁▁  ▂
  ▇█████▇▄▆▆▄▁▁▃▃▁▄▆▆▅▄▃▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▇▁▁▁▁▁▄▁▄▆████ █
  7.26 μs      Histogram: log(frequency) by time      9.28 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 560 evaluations per sample.
 Range (minmax):  205.920 ns279.738 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     208.463 ns                GC (median):    0.00%
 Time  (mean ± σ):   210.090 ns ±   4.679 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▃██                                                    
  ▂▂▂▂▃▅████▅▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▁▂▂▁▁▁▁▁▁▁▁▁▂▂▁▂▂▂▃▄▄▃▃▂▂▂▂▂▂▂ ▃
  206 ns           Histogram: frequency by time          224 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 181 evaluations per sample.
 Range (minmax):  581.530 ns909.050 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     593.597 ns                GC (median):    0.00%
 Time  (mean ± σ):   598.804 ns ±  17.184 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▂▅▇█▄▂                                                
  ▂▂▂▂▃▄▇███████▆▄▄▃▃▃▃▃▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  582 ns           Histogram: frequency by time          646 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.476 μs 15.712 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.546 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.629 μs ± 440.820 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█   ▁                                             ▁▁▁    ▂
  ██▄▇█▇▁▁▁▃▃▄▄▆▆▅▅▁▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▃▃▄▅▇████▆▆ █
  7.48 μs      Histogram: log(frequency) by time      9.54 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 438 evaluations per sample.
 Range (minmax):  232.078 ns352.304 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     235.760 ns                GC (median):    0.00%
 Time  (mean ± σ):   237.593 ns ±   6.025 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▃▇█                                                
  ▂▂▂▃▅██████▆▄▃▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁▂▁▁▂▁▂▂▂▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  232 ns           Histogram: frequency by time          257 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  209.752 μs  6.136 ms   GC (min … max):  0.00% … 93.84%
 Time  (median):     219.180 μs                GC (median):     0.00%
 Time  (mean ± σ):   252.502 μs ± 354.136 μs   GC (mean ± σ):  10.12% ±  6.90%

  ▃▆▅▆▃▂▁                                                      ▂
  ████████▇▅▅▆▄▅▅▅▅▄▄▄▁▄▁▁▄▃▁▁▄▃▆█▇█▇▇▆▆▅▅▄▅▄▄▄▄▃▁▄▅▄▅▅▅▅▆▆▅▆ █
  210 μs        Histogram: log(frequency) by time        462 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  176.501 μs  5.440 ms   GC (min … max):  0.00% … 95.99%
 Time  (median):     184.715 μs                GC (median):     0.00%
 Time  (mean ± σ):   213.408 μs ± 353.301 μs   GC (mean ± σ):  11.99% ±  6.95%

  ▄▆▇█▄▄▅▃▁ ▁▁▂                                                ▂
  ██████████████▇▅▄▅▄▅▃▃▄▄▄▅▁▃▁▃▁▃▃▃▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▆▇█▇▆▆ █
  177 μs        Histogram: log(frequency) by time        319 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.15
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.85 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`