Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, DiffEqOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)
D_DEO = CenteredDifference(derivative_order(D_SBP), accuracy_order(D_SBP),
                           step(x), length(x)) * PeriodicBC(eltype(D_SBP))

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_DEO * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
 Range (minmax):  29.501 ns49.196 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     30.217 ns               GC (median):    0.00%
 Time  (mean ± σ):   30.358 ns ±  1.033 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▂▆▇█▁                                                   
  ▂▄█████▇▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▁▁▁▂▂▁▁▂▁▁▂▁▁▂▁▁▂▁▁▁▁▁▂ ▃
  29.5 ns         Histogram: frequency by time        36.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 595 evaluations.
 Range (minmax):  200.087 ns336.138 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     203.134 ns                GC (median):    0.00%
 Time  (mean ± σ):   204.465 ns ±   6.360 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▄▆██▇▇▅▅▄▃▃▂▁▁          ▁ ▁                                ▃
  █████████████████▇▇█▇▇▇██████▇█▇▇▇▆▅▅▅▅▃▃▅▅▅▅▃▅▄▃▄▁▅▃▄▅▆▅▅▆ █
  200 ns        Histogram: log(frequency) by time        234 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we benchmark the implementation of the same derivative operator in DiffEqOperators.jl.

doit(D_DEO, "D_DEO:", du, u)
D_DEO:
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:412 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/1hqld/src/condense_loopset.jl:1166
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:460 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/1hqld/src/condense_loopset.jl:1166
BenchmarkTools.Trial: 10000 samples with 97 evaluations.
 Range (minmax):  796.124 ns77.474 μs   GC (min … max): 0.00% … 98.17%
 Time  (median):     812.804 ns               GC (median):    0.00%
 Time  (mean ± σ):   862.221 ns ±  1.691 μs   GC (mean ± σ):  4.35% ±  2.20%

   ▃▆▇██▇▆▅▃▃▁▂▁ ▁     ▁▃▄▃▂▁▁▂▁▁                            ▂
  ▇█████████████████▇▇████████████▇▆▇▆▅▇▆▆█▇█▇▆▆▆▆▆▆▅▆▅▅▄▄▄▄ █
  796 ns        Histogram: log(frequency) by time       959 ns <

 Memory estimate: 416 bytes, allocs estimate: 6.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "DiffEqOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9fdde737] DiffEqOperators v4.45.0
  [9f78cca6] SummationByPartsOperators v0.5.60 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
 Range (minmax):  490.552 ns739.835 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     495.665 ns                GC (median):    0.00%
 Time  (mean ± σ):   497.776 ns ±  11.244 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▄▇██▅▄▃▁▁                              ▁                  ▃
  ████████████▇▆▇▇▅▅▃▃▃▃▃▃▆▅▃▅▄▄▅▁▁▅▄▄▆▇████▇▇▇▆▅▅▅▅▄▅▅▄▅▄▃▄ █
  491 ns        Histogram: log(frequency) by time        553 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (minmax):  4.770 μs 10.866 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.819 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.856 μs ± 272.503 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█ ▁                                                      ▁
  ██▆█▇▁▃▁▅▆▆▅▅▄▃▁▄▃▃▁▁▁▁▁▄▁▁▃▄▁▁▁▁▁▄▃▁▅▅▅▇▅▅▃▅▁▅▄▅▃▅▄▅▄▃▃▆ █
  4.77 μs      Histogram: log(frequency) by time      6.43 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (minmax):  6.634 μs 12.886 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.652 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.683 μs ± 279.341 ns   GC (mean ± σ):  0.00% ± 0.00%

  █    ▂                                                     ▁
  █▃▃▆█▄▁▃▁▁▁▃▁▅▅▅▄▃▃▃▁▃▁▁▃▃▁▃▃▄▁▄▃▁▁▁▁▁▃▃▁▁▁▁▁▄▁▁▁▁▁▁▁▅▃▆▆ █
  6.63 μs      Histogram: log(frequency) by time      8.21 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.60 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations.
 Range (minmax):  388.693 ns570.817 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     392.416 ns                GC (median):    0.00%
 Time  (mean ± σ):   395.171 ns ±  13.577 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▇█▁                                                       ▂
  ████▅▆██▆▅▁▁▁▃▁▁▃▄▃▁▁▁▁▁▄▇███▇▅▅▅▅▃▄▄▄▄▄▁▃▅▄▆▆▇▅▆▆▆▅▇▅▆▅▆▆ █
  389 ns        Histogram: log(frequency) by time        473 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.010 μs 2.523 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.019 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.023 μs ± 49.558 ns   GC (mean ± σ):  0.00% ± 0.00%

               ▃  ▆ █ ▇  ▆ ▃ ▁                               
  ▂▁▂▁▂▁▃▂▁▄▁█▁█▁▁█▁█▁█▂█▁█▁█▂▂▁▆▁▅▃▁▅▁▅▁▄▁▂▄▁▃▁▃▁▁▂▁▂▁▂▁▂ ▃
  1.01 μs        Histogram: frequency by time        1.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (minmax):  5.484 μs 14.816 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.535 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.713 μs ± 946.814 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▇▆▁▁▁▄▁▃▄▅▇▅▆▄▅▅▄▃▃▄▃▁▇▁▁▄▄▃▄▃▃▃▃▃▁▁▃▃▃▁▁▁▁▁▁▅▇▇▇▆▆▆▆▆▆▆ █
  5.48 μs      Histogram: log(frequency) by time      11.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (minmax):  6.173 μs 12.451 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.190 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.230 μs ± 349.310 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                           ▁
  █▄▄▆▆▃▃▄▃▃▄▁▄▆▅▃▄▄▄▄▆▇▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▁▁▁▆▄▄▃▁▁▁▁▁▁▁▄▄▆▄▅▅ █
  6.17 μs      Histogram: log(frequency) by time      7.86 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  149.539 μs358.838 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     150.901 μs                GC (median):    0.00%
 Time  (mean ± σ):   152.818 μs ±   7.025 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▂▇█▄▂▂▁           ▂▂▂▁▂▂▁▁                                  ▂
  ████████▇▇▇▆▇▅▆▇▇▇██████████▇▆▆▄▄▅▅▅▅▆▄▆▆▆▅▆▆▅▆▅▅▄▁▃▅▅▁▅▄▄▄ █
  150 μs        Histogram: log(frequency) by time        181 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.60 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations.
 Range (minmax):  45.666 ns80.675 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     46.426 ns               GC (median):    0.00%
 Time  (mean ± σ):   46.613 ns ±  1.307 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▃██                                                    
  ▃▇███▇▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▁▁▁▂▂▁▂▁▂▁▁▂▁▁▂▁▁▂▁▁▂▂▁▂▂▂▂ ▃
  45.7 ns         Histogram: frequency by time        54.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 221 evaluations.
 Range (minmax):  333.561 ns559.005 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     340.814 ns                GC (median):    0.00%
 Time  (mean ± σ):   342.608 ns ±   9.594 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▂▇█▄▁                                                  
  ▂▂▂▂▃▆█████▆▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  334 ns           Histogram: frequency by time          380 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.714 μs  5.051 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.724 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.784 μs ± 211.326 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▅▄█▅▁▃█▁▃▅▅▄▄▃▁▁▁▁▄▃▁▁▁▁▄▁▃▇▇▇█▇█▇▇▇▇▇█▇▆▆▆▆▆▆▆▇▆▇▇▇▇▇▇█ █
  1.71 μs      Histogram: log(frequency) by time      2.65 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.351 μs 4.107 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.367 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.371 μs ± 65.886 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▅▁█                                              
  ▂▂▂▄▄█▆████▆▄▆▃▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▂▂▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▂▂ ▃
  1.35 μs        Histogram: frequency by time        1.44 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (minmax):  2.366 μs  5.121 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.412 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.428 μs ± 134.533 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▆█▄▁                                                     ▂
  ████████▆▃▃▁▁▄▄▅▆▅▃▁▄▃▃▁▃▃▁▁▁▁▁▃▁▁▃▁▁▁▁▃▁▁▄▁▁▁▁▁▄▄▃▁▁▄▄▅▄ █
  2.37 μs      Histogram: log(frequency) by time      3.28 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (minmax):  8.833 μs 17.282 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.957 μs                GC (median):    0.00%
 Time  (mean ± σ):   8.999 μs ± 394.272 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▁█    ▁                                                  ▂
  ███▅▅▄██▄▃▁▃▃▁▁▃▄▆▆▄▃▄▁▁▁▃▃▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▆ █
  8.83 μs      Histogram: log(frequency) by time      11.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 550 evaluations.
 Range (minmax):  209.773 ns259.193 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     211.649 ns                GC (median):    0.00%
 Time  (mean ± σ):   212.314 ns ±   3.183 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▃▇██▄▃▂▁                                       ▁        ▂
  ▄▆▇█████████▇▅▇██▇▆▅▅▄▃▄▁▁▁▃▃▁▁▁▁▃▁▅▄▃▄▃▅▄▃▅▁▁▁▅▆████▇▇▆▇▇ █
  210 ns        Histogram: log(frequency) by time        227 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 189 evaluations.
 Range (minmax):  538.201 ns810.503 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     549.116 ns                GC (median):    0.00%
 Time  (mean ± σ):   551.575 ns ±  13.884 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▅██                                                   
  ▂▂▂▃▅████▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂ ▃
  538 ns           Histogram: frequency by time          616 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (minmax):  7.088 μs 16.065 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.133 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.193 μs ± 486.662 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                         ▂
  █▅▆█▆▄▁▃▄▄▆▆▄▄▃▄▁▁▁▁▃▄▁▃▁▁▃▁▁▁▁▁▁▁▁▃▄▃▁▃▁▁▄▄▅▆▅▅▆▅▅▃▅▄▄▃▄ █
  7.09 μs      Histogram: log(frequency) by time      9.62 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 565 evaluations.
 Range (minmax):  206.384 ns268.359 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     209.949 ns                GC (median):    0.00%
 Time  (mean ± σ):   210.685 ns ±   3.805 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▃▆▇█▄▃▁                          ▁▁                   ▂
  ▆▇▅▇█████████▇█▇▇▅▆▃▄▃▅▃▃▁▁▃▆▄▃▄▄▄▄▅▆████▇▆▆▅▆▅▆▅▄▁▅▄▃▄▄▄▅ █
  206 ns        Histogram: log(frequency) by time        231 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 187 evaluations.
 Range (minmax):  550.273 ns790.353 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     561.529 ns                GC (median):    0.00%
 Time  (mean ± σ):   563.814 ns ±  11.397 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▂▅██                                                
  ▂▁▂▂▃▄▇████▇▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  550 ns           Histogram: frequency by time          613 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (minmax):  7.073 μs 15.489 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.126 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.175 μs ± 461.512 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆█                                                        ▁
  ██▁▄▇█▆▁▃▄▃▁▃▅▆▅▅▄▃▄▁▁▇▃▁▁▁▁▁▁▁▁▁▁▁▃▁▄▁▁▃▁▁▃▃▁▁▃▁▁▁▁▁▃▃▁▆ █
  7.07 μs      Histogram: log(frequency) by time      9.07 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 438 evaluations.
 Range (minmax):  233.128 ns376.205 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     235.279 ns                GC (median):    0.00%
 Time  (mean ± σ):   236.995 ns ±   9.144 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█▁▁           ▁▁                                          ▂
  ████▇▅▅▄▃▃▃▄▃▆████▆▆▄▅▄▃▁▁▄▃▄▆▄▆▄▄▄▁▄▄▄▄▄▄▁▄▁▄▄▃▁▃▃▅▅▆▅▅▅▅ █
  233 ns        Histogram: log(frequency) by time        297 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  214.390 μs  8.182 ms   GC (min … max):  0.00% … 96.13%
 Time  (median):     222.936 μs                GC (median):     0.00%
 Time  (mean ± σ):   272.559 μs ± 455.235 μs   GC (mean ± σ):  10.67% ±  6.19%

  ▂▆█▃▃▃▁                                   ▃▄▃▁▁              ▂
  ██████████▇▆▅▅▅▄▅▄▆▅▄▃▅▄▄▃▃▃▁▃▄▄▄▁▁▃▃▃▁▁▅██████▇▆▅▆▅▅▅▅▅▅▃▆ █
  214 μs        Histogram: log(frequency) by time        406 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  171.219 μs  8.404 ms   GC (min … max):  0.00% … 96.18%
 Time  (median):     180.105 μs                GC (median):     0.00%
 Time  (mean ± σ):   227.064 μs ± 445.609 μs   GC (mean ± σ):  12.53% ±  6.21%

  ▂▅▆█▃▁▃▂                                            ▃▄▃▂▁▁   ▂
  █████████▇▇▇▇▇▅▅▄▅▆▄▅▄▄▅▄▅▄▅▃▅▄▄▁▁▃▃▄▁▅▁▁▃▄▃▁▄▃▁▅▇███████▇▇ █
  171 μs        Histogram: log(frequency) by time        324 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.3
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.60 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`