Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, DiffEqOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)
D_DEO = CenteredDifference(derivative_order(D_SBP), accuracy_order(D_SBP),
                           step(x), length(x)) * PeriodicBC(eltype(D_SBP))

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_DEO * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  28.345 ns63.435 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     29.453 ns               GC (median):    0.00%
 Time  (mean ± σ):   30.022 ns ±  2.528 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▆▇██▇▇▄▃▁▁                       ▁▁▁                     ▃
  ▇██████████▇▆▆▆▃▁▄▁▁▁▁▁▁▁▃▁▁▄▁▁▅▇██████▇▇▇▆▅▆▆▇▆█▇▇▇▇▇██▇ █
  28.3 ns      Histogram: log(frequency) by time      41.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 656 evaluations per sample.
 Range (minmax):  187.684 ns357.271 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     200.055 ns                GC (median):    0.00%
 Time  (mean ± σ):   199.929 ns ±   6.474 ns   GC (mean ± σ):  0.00% ± 0.00%

                        ▂█▅▁                                   
  ▁▁▁▂▂▃▃▄▅▄▅▄▄▄▃▃▃▃▂▂▂▅████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▂
  188 ns           Histogram: frequency by time          219 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we benchmark the implementation of the same derivative operator in DiffEqOperators.jl.

doit(D_DEO, "D_DEO:", du, u)
D_DEO:
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:412 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/tIJUA/src/condense_loopset.jl:1166
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:460 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/tIJUA/src/condense_loopset.jl:1166
BenchmarkTools.Trial: 10000 samples with 115 evaluations per sample.
 Range (minmax):  761.078 ns73.696 μs   GC (min … max): 0.00% … 98.16%
 Time  (median):     780.330 ns               GC (median):    0.00%
 Time  (mean ± σ):   841.573 ns ±  1.696 μs   GC (mean ± σ):  4.90% ±  2.41%

  ▂▇█▆▅▄▃▂▂▃▂▃▃▃▄▄▃▁▁                                        ▂
  █████████████████████▆▇▇▇█▇▇▆▇▆▇▄▆▄▄▄▃▄▃▄▃▁▃▅▄▁▁▃▁▃▄▁▃▁▄▅▄ █
  761 ns        Histogram: log(frequency) by time       1.1 μs <

 Memory estimate: 416 bytes, allocs estimate: 6.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "DiffEqOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9fdde737] DiffEqOperators v4.45.0
  [9f78cca6] SummationByPartsOperators v0.5.75 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  382.394 ns680.739 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.788 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.128 ns ±  12.330 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▁▆█                                                    
  ▂▃▅████▆▄▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▂▁▂▁▁▁▁▁▂▁▂▂▁▂▁▂▂▂▂▂▂▃▃▃▃▂▂▂▂▂▂ ▃
  382 ns           Histogram: frequency by time          428 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.539 μs 10.710 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.581 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.628 μs ± 257.673 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                                  ▁▁   ▂
  ████▅▆▃▃▅▄▄▅▅▅▆▆▅▅▄▄▃▁▁▁▃▃▁▁▃▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▅▇█████ █
  4.54 μs      Histogram: log(frequency) by time      5.68 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.638 μs 15.812 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.658 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.744 μs ± 476.318 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                           ▁
  ███▃▅▅▄▄▄▄▆▅▅▄▃▃▄▃▄▄▄▄▃▃▁▁▃▄▁▃▁▃▁▁▁▁▁▄▆███▇▆▆▆▆▆▅▄▆▆▄▄▄▆ █
  6.64 μs      Histogram: log(frequency) by time      8.66 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.75 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  382.887 ns790.310 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     387.872 ns                GC (median):    0.00%
 Time  (mean ± σ):   392.638 ns ±  16.771 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▅▄                    ▂▃▃▂▁                             ▂
  ██████▇▇██▇▆▅▄▃▃▁▆▃▃▃▃▁▄▄██████▇█▇▆▇▅▅▆▆▁▄▄▅▅▆▆▆▆▆▆▅▅▆▆▆▆▆▆ █
  383 ns        Histogram: log(frequency) by time        464 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.011 μs 2.318 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.020 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.031 μs ± 82.354 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                         ▁
  █▄▄▁▅█▃▄▃▁▁▁▃▅█▅▇▇▄▅▃▄▄▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▄ █
  1.01 μs      Histogram: log(frequency) by time     1.72 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.280 μs 11.111 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.367 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.416 μs ± 260.219 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█                                                ▁▁▁▁ ▂
  ███████▇▆▅▅▅▄▆▆▅▆▄▃▁▃▁▁▃▁▁▁▁▁▃▁▃▇▆▄▁▃▃▁▁▁▁▁▁▁▁▁▁▁▅▇█████ █
  5.28 μs      Histogram: log(frequency) by time      6.62 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.184 μs 17.006 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.210 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.672 μs ± 783.935 ns   GC (mean ± σ):  0.00% ± 0.00%

            ▇       ▁                                        ▁
  ▇▇▆▄▆▅▃▆▅█▃▆▅▄▅▅██▇▆▆▆▆▆▇▇▆▇█▆▆▅▅▅▆▅▅▄▃▃▁▆▁▁▃▆▁▄▁▁▁▁▁▄▁▄▃ █
  6.18 μs      Histogram: log(frequency) by time      10.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  132.739 μs354.145 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     134.252 μs                GC (median):    0.00%
 Time  (mean ± σ):   136.271 μs ±   6.375 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇▆▃▂▁           ▁▃▄▄▂▁                                    ▁
  ▇███████▇▇▆▅▅▅▅▅▆▆██████████▇▇▇▅▄▅▄▅▄▅▄▄▅▃▅▅▆▅▅▅▄▄▅▂▂▄▄▂▃▂▄ █
  133 μs        Histogram: log(frequency) by time        159 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.75 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  46.113 ns93.989 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     46.771 ns               GC (median):    0.00%
 Time  (mean ± σ):   47.569 ns ±  3.150 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇█▇▆▂                    ▂▂▂▁▁                          ▁ ▂
  █████▇▇▆▇▆▆▃▁▃▃▃▁▃▃▁▃▃▁▁██████▇▆▅▄▅▄▃▄▆▃▄▃▄▄▁▁▄▄▃▄▅▃▇▇███ █
  46.1 ns      Histogram: log(frequency) by time      62.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 235 evaluations per sample.
 Range (minmax):  319.536 ns618.098 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     328.187 ns                GC (median):    0.00%
 Time  (mean ± σ):   331.760 ns ±  11.915 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▅██▆▄                                               
  ▁▁▁▂▄████████▇▅▅▄▄▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁ ▂
  320 ns           Histogram: frequency by time          369 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.677 μs  4.461 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.687 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.709 μs ± 133.336 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▃▄▅▃▁▁▁▄▄▁▄▆▄▄▁▁▁▁▄▁▃▁▁▃▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▅▇▇██ █
  1.68 μs      Histogram: log(frequency) by time      2.46 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.325 μs 2.834 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.335 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.346 μs ± 90.880 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▂▂▂▂▂▁▁▁▂▁▁▁▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂ ▂
  1.32 μs        Histogram: frequency by time        2.06 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.339 μs  5.801 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.406 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.430 μs ± 149.867 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃▆█▃▁                                               ▁   ▂
  ▇██████▇▆▄▃▁▄▅▆▅▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▃▄▆▇████ █
  2.34 μs      Histogram: log(frequency) by time      3.25 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.747 μs 20.752 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.810 μs                GC (median):    0.00%
 Time  (mean ± σ):   8.906 μs ± 527.785 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                                        ▁
  ███▄▅▇▄▄▅▄▅▃▅▄▆▆▄▅▄▅▄▄▁▄▁▁▁▁▄▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▁▃▄▇████ █
  8.75 μs      Histogram: log(frequency) by time      11.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 550 evaluations per sample.
 Range (minmax):  208.518 ns290.236 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     211.251 ns                GC (median):    0.00%
 Time  (mean ± σ):   213.112 ns ±   5.431 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▂▆█                                                     
  ▂▂▂▃▅████▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  209 ns           Histogram: frequency by time          230 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 189 evaluations per sample.
 Range (minmax):  537.413 ns913.460 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     546.370 ns                GC (median):    0.00%
 Time  (mean ± σ):   551.372 ns ±  16.449 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▅██                                                
  ▂▂▂▃▄█████▇▆▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▁▁▁▂▂▂▂▃▃▃▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂ ▃
  537 ns           Histogram: frequency by time          598 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.298 μs 17.753 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.359 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.551 μs ± 837.267 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                  ▁▁                                      ▁
  █▅▆▇▄▃▃▃▁▃▁▁▁▁▁▄▄██▇▅▅▇▇▇▇▆▅▄▅▄▅▅▄▄▄▅▄▄▄▄▄▅▅▄▄▄▇▄▅▅▅▄▅▄▅ █
  7.3 μs       Histogram: log(frequency) by time      12.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 535 evaluations per sample.
 Range (minmax):  211.312 ns293.297 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     215.228 ns                GC (median):    0.00%
 Time  (mean ± σ):   217.868 ns ±   8.096 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▁▄▆█▆▄▁▁          ▂▄▄▃▁                                    ▂
  ▇████████▇▅▅▁▄▅▅▄▅███████▇▆▄▃▁▃▄▄▅▆▇▆▄▅▅▄▄▅▄▅█▇▆▆▅▇▅▆▆▆▆▅▆█ █
  211 ns        Histogram: log(frequency) by time        257 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 179 evaluations per sample.
 Range (minmax):  590.492 ns 1.070 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     603.034 ns               GC (median):    0.00%
 Time  (mean ± σ):   611.317 ns ± 34.349 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▅█▁    ▁▄▄▂                                              ▂
  █████▆▄▄█████▆▅▄▄▄▅▅▇▆▅▄▄▄▅▄▁▃▄▁▃▃▄▃▃▁▁▃▁▄▃▁▃▃▁▄▁▃▃▁▁▃▄▆▆ █
  590 ns        Histogram: log(frequency) by time       834 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.381 μs 23.820 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.444 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.540 μs ± 652.149 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█  ▁                                             ▁▁▁    ▂
  ███▃▆██▁▄▄▄▄▆▆▇▆▃▃▄▁▃▃▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▃▇████▇▆ █
  7.38 μs      Histogram: log(frequency) by time      9.36 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 434 evaluations per sample.
 Range (minmax):  232.901 ns409.917 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     235.005 ns                GC (median):    0.00%
 Time  (mean ± σ):   237.994 ns ±  10.527 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█▅▃          ▃▄▂▁                                          ▂
  █████▇▅▅▄▄▃▄▁█████▇▇▆▅▅▁▅▅▄▆▆▆▆▅▅▄▅▁▄▁▄▅▃▄▄▃▄▅▁▃▁▁▃▄▅▄▅▅▅▅▅ █
  233 ns        Histogram: log(frequency) by time        298 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  212.800 μs 14.505 ms   GC (min … max):  0.00% … 51.22%
 Time  (median):     221.361 μs                GC (median):     0.00%
 Time  (mean ± σ):   274.235 μs ± 477.409 μs   GC (mean ± σ):  10.73% ±  6.16%

  ▂▅█▅▅▃▂▁▁▁                              ▂▄▃▂▃▂               ▂
  ████████████▆▆▅▅▄▅▄▅▄▁▄▄▄▁▄▃▄▄▅▁▃▃▄▅▁▅▅██████▇▆▅▁▅▄▅▄▃▄▅▃▅▅ █
  213 μs        Histogram: log(frequency) by time        415 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  176.882 μs 17.087 ms   GC (min … max):  0.00% … 39.59%
 Time  (median):     186.120 μs                GC (median):     0.00%
 Time  (mean ± σ):   234.939 μs ± 464.242 μs   GC (mean ± σ):  12.02% ±  6.21%

  ▃▅▅█▃▅▅▂▁                                           ▂▃▃▃▂▂▂▁ ▂
  █████████████▇▇▅▅▅▅▆▄▅▅▅▄▃▅▅▅▃▄▅▄▄▃▃▁▅▁▁▄▁▃▃▁▃▃▃▄▄▄████████ █
  177 μs        Histogram: log(frequency) by time        330 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.11
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.75 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`