Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  28.848 ns61.623 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     29.603 ns               GC (median):    0.00%
 Time  (mean ± σ):   29.872 ns ±  1.455 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▃▆▇                                                    
  ▂▅███▇▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▂▁▂▂▁▂▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  28.8 ns         Histogram: frequency by time        37.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 661 evaluations per sample.
 Range (minmax):  185.794 ns341.337 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     200.027 ns                GC (median):    0.00%
 Time  (mean ± σ):   200.027 ns ±   7.123 ns   GC (mean ± σ):  0.00% ± 0.00%

                         ▅█▄                                   
  ▁▁▁▁▁▂▂▃▄▅▅▆▅▅▅▄▃▃▃▂▂▂▆████▆▅▄▃▂▂▂▂▂▂▂▂▂▂▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  186 ns           Histogram: frequency by time          221 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.79 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations per sample.
 Range (minmax):  386.218 ns706.574 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     389.545 ns                GC (median):    0.00%
 Time  (mean ± σ):   393.320 ns ±  13.464 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▆▅▄▂                                   ▁▃▃▃▂▂           ▂
  █████████▆▄▆▆▇█▇▆▄▄▃▁▃▁▃▁▁▁▁▁▁▁▁▃▃▁▁▁▃▄▁▁▅▇███████▇▇▇▆▆▅▅▄▆ █
  386 ns        Histogram: log(frequency) by time        437 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.473 μs  9.303 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.511 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.561 μs ± 264.973 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                                ▁▁      ▁
  ██▅▆▆▅▃▄▃▃▅▄▅▆▆▅▄▃▃▁▃▁▁▄▁▃▁▁▃▁▁▁▁▁▁▃▃▃▁▁▁▁▁▁▁▁▄▅▇███▇▇▆▆ █
  4.47 μs      Histogram: log(frequency) by time      5.74 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.689 μs 13.900 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.713 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.788 μs ± 406.230 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▃▅▁▁▁▃▁▃▃▄▅▅▅▄▄▁▄▁▃▁▁▁▄▃▁▄▃▁▄▁▃▁▁▁▁▁▁▁▁▃▅▇███▇▇▆▆▆▅▅▅▄▅ █
  6.69 μs      Histogram: log(frequency) by time      8.65 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.79 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  385.305 ns623.975 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     389.054 ns                GC (median):    0.00%
 Time  (mean ± σ):   392.710 ns ±  13.231 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▆██▆▅▄▂▁                                   ▁▃▃▃▂▁▁        ▂
  ███████████▇▅▆▇█▆▆▅▅▁▅▃▁▁▁▁▁▃▃▁▁▁▃▁▁▁▁▃▃▃▁▃▅▇█████████▆▆▆▆▅ █
  385 ns        Histogram: log(frequency) by time        434 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.014 μs 3.045 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.022 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.033 μs ± 93.334 ns   GC (mean ± σ):  0.00% ± 0.00%

  █ ▂▁▂▁▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂ ▂
  1.01 μs        Histogram: frequency by time        1.74 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.153 μs 14.037 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.231 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.290 μs ± 312.523 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█                                              ▁▁▁     ▂
  ████▅▅▅▃▅▅▁▄▅▆▄▅▄▆▄▁▄▃▁▁▁▃▃▁▃▃▁▁▃▃▄▁▃▁▃▁▄▃▃▄▃▅▄▆██████▇▅ █
  5.15 μs      Histogram: log(frequency) by time      6.63 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.216 μs 17.503 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.240 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.318 μs ± 413.207 ns   GC (mean ± σ):  0.00% ± 0.00%

  █    ▂              ▁                                       ▁
  █▄█▅▃▁▁▁▃▅▅▄▄▁▄▃▇█▁▃▁▅▁▁▄▁▄▄▁▃▁▃▁▃▁▁▁▁▃▆██▇▆▆▅▆▆▅▇▆▆▆▅▅▅ █
  6.22 μs      Histogram: log(frequency) by time      8.18 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  136.906 μs332.263 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     155.421 μs                GC (median):    0.00%
 Time  (mean ± σ):   156.552 μs ±   8.587 μs   GC (mean ± σ):  0.00% ± 0.00%

                        ▂▇█                                    
  ▂▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃▆███▄▂▂▂▂▂▂▂▃▄▅▄▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  137 μs           Histogram: frequency by time          183 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.79 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  45.768 ns92.083 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     46.396 ns               GC (median):    0.00%
 Time  (mean ± σ):   46.875 ns ±  1.979 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄██▃▁                                                     
  ▄█████▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▂▃▃▂▂▂▂▂ ▃
  45.8 ns         Histogram: frequency by time        54.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 233 evaluations per sample.
 Range (minmax):  320.601 ns533.661 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     328.300 ns                GC (median):    0.00%
 Time  (mean ± σ):   331.773 ns ±  11.622 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▃▇█▇▄▁                                                  
  ▁▁▂▄████████▆▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁ ▂
  321 ns           Histogram: frequency by time          370 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.720 μs  4.920 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.729 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.751 μs ± 132.154 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▃▄▁▆▅▁▁▁▁▄▁▄▃▄▄▅▄▅▃▃▄▁▁▁▁▁▁▁▄▁▁▁▄▁▃▁▃▃▃▅▁▁▁▁▁▁▃▃█▄▁▁▄▆█▇ █
  1.72 μs      Histogram: log(frequency) by time      2.49 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.311 μs  3.463 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.321 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.335 μs ± 106.449 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▁▁▇▆▃▃▃▁▃▁▁▁▃▃▅▄▄▄▁▃▁▄▄▅▅▃▄▄▃▁▃▄▃▅▄▄▄▃▃▃▁▁▁▁▁▁▁▁▁▁▁▃▄▆▇ █
  1.31 μs      Histogram: log(frequency) by time      2.05 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.350 μs  6.216 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.408 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.432 μs ± 155.242 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▄▂                                                    ▂
  ███████▇▇▆▅▁▁▁▁▃▅▃▅▄▃▅▄▁▃▁▃▄▇▇▅▃▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▃▁▃▁▃▆▇███ █
  2.35 μs      Histogram: log(frequency) by time      3.24 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  9.044 μs 19.266 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.107 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.207 μs ± 534.849 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                                        ▁
  ██▆▄██▁▁▁▃▃▁▁▃▅▅▆▅▄▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▃▃▁▁▁▁▃▁▁▁▁▃▁▃▁▇▄▅▇████ █
  9.04 μs      Histogram: log(frequency) by time      11.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 560 evaluations per sample.
 Range (minmax):  204.939 ns600.163 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     209.123 ns                GC (median):    0.00%
 Time  (mean ± σ):   211.101 ns ±   6.970 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▃▆██▆▄▃                      ▁▃▄▄▃▂▁▁                 ▂
  ▃▁▁▁▅████████████▇▆▅▄▅▅▄▄▄▃▄▄▄▁▁▃▃▆█████████▆▆▆▆▅▆▅▅▅▇▄▆▅▅▄ █
  205 ns        Histogram: log(frequency) by time        231 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 188 evaluations per sample.
 Range (minmax):  536.649 ns800.867 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     549.383 ns                GC (median):    0.00%
 Time  (mean ± σ):   554.288 ns ±  16.882 ns   GC (mean ± σ):  0.00% ± 0.00%

          ▄▇█▄▁                                                
  ▂▁▂▂▂▃▅██████▇▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▃▃▃▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂ ▃
  537 ns           Histogram: frequency by time          606 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.354 μs 21.823 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.424 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.531 μs ± 653.292 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█   ▁                                     ▁▁              ▁
  ████▆▁▁▁▁▃▅▆▅▅▄▁▄▁▁▁▁▁▃▁▁▃▁▁▁▃▁▁▁▃▁▁▁▁▃▆▇███▆▆▅▄▃▄▄▅▃▅▃▄ █
  7.35 μs      Histogram: log(frequency) by time      9.88 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 565 evaluations per sample.
 Range (minmax):  206.228 ns292.195 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     208.391 ns                GC (median):    0.00%
 Time  (mean ± σ):   210.231 ns ±   5.077 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▅██▅▄▄▃                               ▂▄▄▃▂▂            ▂
  ▄▅▅████████▇▆▆▇██▇▆▅▅▄▄▁▁▄▄▃▃▁▃▃▄▄▃▃▃▁▃▃▃██████████▆▄▆▅▅▆▆▆ █
  206 ns        Histogram: log(frequency) by time        226 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 187 evaluations per sample.
 Range (minmax):  552.000 ns807.021 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     561.107 ns                GC (median):    0.00%
 Time  (mean ± σ):   566.038 ns ±  15.954 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▂▆█▄▁                                                   
  ▂▂▃▃▆█████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▂▁▂▂▂▃▃▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  552 ns           Histogram: frequency by time          618 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.316 μs 18.886 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.396 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.478 μs ± 457.395 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▇   ▁                                             ▁▁    ▂
  ███▇▆██▅▃▁▃▁▁▅▅▅▄▄▄▄▄▁▁▁▁▁▁▁▁▄▃▁▁▁▃▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▄▇████▇ █
  7.32 μs      Histogram: log(frequency) by time      9.37 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 478 evaluations per sample.
 Range (minmax):  223.726 ns319.952 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     226.409 ns                GC (median):    0.00%
 Time  (mean ± σ):   228.417 ns ±   6.140 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▅██                                                   
  ▂▃▄█████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▂▂▂▂▂▂▁▂▂▂▃▃▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  224 ns           Histogram: frequency by time          249 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  213.290 μs  5.016 ms   GC (min … max): 0.00% … 94.60%
 Time  (median):     225.563 μs                GC (median):    0.00%
 Time  (mean ± σ):   254.551 μs ± 321.663 μs   GC (mean ± σ):  9.62% ±  7.19%

   ▂▄▅█▄▄▆▅▃▂ ▁                                                ▂
  ▅███████████████▇▆▄▆▄▅▄▄▅▃▅▄▄▃▅▁▄▁▄▁▃▁▁▁▁▁▃▃▁▁▁▁▁▁▁▃▁▅▇▇▇▅▅ █
  213 μs        Histogram: log(frequency) by time        366 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  171.242 μs  4.584 ms   GC (min … max):  0.00% … 95.38%
 Time  (median):     180.900 μs                GC (median):     0.00%
 Time  (mean ± σ):   208.233 μs ± 315.658 μs   GC (mean ± σ):  11.56% ±  7.27%

  ▃▅█▇▅▅▅▅▄▂▂▁▂▁                                               ▂
  ███████████████▆▃▅▅▄▃▄▄▃▃▁▁▄▄▁▃▃▁▁▃▁▁▁▁▁▁▁▁▁▃▁▃▁▁▁▁▁▁▄▅██▇▅ █
  171 μs        Histogram: log(frequency) by time        316 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.13
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.79 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`