Benchmarks
Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.
First-derivative operators
Periodic domains
Let's set up some benchmark code.
using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators
BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair
T = Float64
xmin, xmax = T(0), T(1)
D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)
D_sparse = sparse(D_SBP)
u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u
function doit(D, text, du, u)
println(text)
sleep(0.1)
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
println()
enddoit (generic function with 1 method)First, we benchmark the implementation from SummationByPartsOperators.jl.
doit(D_SBP, "D_SBP:", du, u)D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
Range (min … max): 27.186 ns … 50.979 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 27.860 ns ┊ GC (median): 0.00%
Time (mean ± σ): 28.212 ns ± 1.757 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇█▅▄▁
▅██████▆▄▄▃▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▁▂▁▁▁▂▁▁▂▁▁▁▁▂▂▁▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
27.2 ns Histogram: frequency by time 37.1 ns <
Memory estimate: 0 bytes, allocs estimate: 0.Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.
doit(D_sparse, "D_sparse:", du, u)D_sparse:
BenchmarkTools.Trial: 10000 samples with 600 evaluations per sample.
Range (min … max): 198.302 ns … 396.888 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 209.305 ns ┊ GC (median): 0.00%
Time (mean ± σ): 211.130 ns ± 8.840 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▆▆█▆▇█▆▅▃▂▂
▁▁▁▁▁▂▂▄▅▆▇▇██████████████▆▆▄▄▃▃▃▂▂▂▂▃▃▃▃▃▄▃▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁ ▃
198 ns Histogram: frequency by time 232 ns <
Memory estimate: 0 bytes, allocs estimate: 0.These results were obtained using the following versions.
using InteractiveUtils
versioninfo()
using Pkg
Pkg.status(["SummationByPartsOperators"],
mode=PKGMODE_MANIFEST)Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × AMD EPYC 7763 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
[9f78cca6] SummationByPartsOperators v0.5.91 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`Bounded domains
We start again by setting up some benchmark code.
using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices
BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair
T = Float64
xmin, xmax = T(0), T(1)
D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)
u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u
function doit(D, text, du, u)
println(text)
sleep(0.1)
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
println()
enddoit (generic function with 1 method)First, we benchmark the implementation from SummationByPartsOperators.jl.
doit(D_SBP, "D_SBP:", du, u)D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
Range (min … max): 382.685 ns … 603.685 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 386.281 ns ┊ GC (median): 0.00%
Time (mean ± σ): 390.251 ns ± 13.838 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▅▇██▇▆▄▃▂▁ ▂▂▃▃▂▁ ▂
████████████▇▇█▇▇▅▅▄▄▁▃▁▁▁▃▃▁▃▁▁▁▁▁▁▁▃▁▃▁▁▃▄▄▄▅▃▅▇████████▆▆▆ █
383 ns Histogram: log(frequency) by time 435 ns <
Memory estimate: 0 bytes, allocs estimate: 0.Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.
doit(D_sparse, "D_sparse:", du, u)D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
Range (min … max): 4.488 μs … 8.783 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.584 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.639 μs ± 285.829 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇██▆▃ ▁ ▁▁▁ ▂
▇█████████▆▄▃▅▅▆▆▁▄▅▄▃▁▃▁▁▃▃▁▁▁▄▁▁▁▁▁▁▁▃▁▄▃▃▃▁▁▄▁▁▄▇▇█████▆ █
4.49 μs Histogram: log(frequency) by time 5.92 μs <
Memory estimate: 0 bytes, allocs estimate: 0.Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.
doit(D_banded, "D_banded:", du, u)D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 9.507 μs … 76.666 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 9.537 μs ┊ GC (median): 0.00%
Time (mean ± σ): 9.693 μs ± 1.282 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁ ▁
█▅▄▁▄▅▄▁▃▃▁▁▄▄▁▄█▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃█▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅ █
9.51 μs Histogram: log(frequency) by time 18 μs <
Memory estimate: 0 bytes, allocs estimate: 0.These results were obtained using the following versions.
using InteractiveUtils
versioninfo()
using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
mode=PKGMODE_MANIFEST)Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × AMD EPYC 7763 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
[aae01518] BandedMatrices v1.11.0
[9f78cca6] SummationByPartsOperators v0.5.91 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`Dissipation operators
We follow the same structure as before. At first, we set up some benchmark code.
using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices
BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair
T = Float64
xmin, xmax = T(0), T(1)
D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full = Matrix(Di_SBP)
u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u
function doit(D, text, du, u)
println(text)
sleep(0.1)
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
println()
enddoit (generic function with 1 method)At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.
doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations per sample.
Range (min … max): 386.411 ns … 779.322 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 389.436 ns ┊ GC (median): 0.00%
Time (mean ± σ): 393.478 ns ± 14.537 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
██▃
▃████▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▃▂▂▂▂▂▂▂ ▃
386 ns Histogram: frequency by time 439 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 14 evaluations per sample.
Range (min … max): 986.786 ns … 20.332 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 996.857 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.022 μs ± 327.317 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇ ▁
██▆▃▄▄█▄▁▄▁▁▆▇▇▅▅▃▄▅▃▃▃▅▆▆▅▆▅▃▃▃▁▃▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆█▇ █
987 ns Histogram: log(frequency) by time 1.62 μs <
Memory estimate: 0 bytes, allocs estimate: 0.Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.
doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)Di_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
Range (min … max): 4.859 μs … 10.894 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.902 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.960 μs ± 299.584 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▇▂ ▁▁▁ ▁
████▃▅█▇▄▁▃▁▆▆▆▃▄▃▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▇████▆ █
4.86 μs Histogram: log(frequency) by time 6.23 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
Range (min … max): 6.883 μs … 16.070 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.909 μs ┊ GC (median): 0.00%
Time (mean ± σ): 7.004 μs ± 528.775 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅ ▁▁ ▁
██▅▁▅▇▁▁▁▁▃▆▆▅▄▁▃▁▁▄▁▁▁▁▁▁▃▁▁▁▁▁▃▃▁▁▁▃▁▁▁▁▁▁▃▁▁▃▇████▇▆▆▅▅▃ █
6.88 μs Histogram: log(frequency) by time 8.91 μs <
Memory estimate: 0 bytes, allocs estimate: 0.Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷
doit(Di_full, "Di_full:", du, u)Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 115.094 μs … 291.032 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 116.486 μs ┊ GC (median): 0.00%
Time (mean ± σ): 119.118 μs ± 7.947 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▇▅▃▂▂▂▁ ▁▃▃▃▂▁▁ ▁
████████████▆▇▆▆▆▆▆▇▆▇▆▇██████████▇▆▇▆▅▅▆▅▅▄▄▃▅▄▄▅▆▅▄▅▄▅▂▅▄▂▄ █
115 μs Histogram: log(frequency) by time 143 μs <
Memory estimate: 0 bytes, allocs estimate: 0.These results were obtained using the following versions.
using InteractiveUtils
versioninfo()
using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
mode=PKGMODE_MANIFEST)Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × AMD EPYC 7763 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
[aae01518] BandedMatrices v1.11.0
[9f78cca6] SummationByPartsOperators v0.5.91 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`Structure-of-Arrays (SoA) and Array-of-Structures (AoS)
SummationByPartsOperators.jl tries to provide efficient support of
StaticVectors from StaticArrays.jl- StructArrays.jl
To demonstrate this, let us set up some benchmark code.
using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators
BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair
struct Vec5{T} <: FieldVector{5,T}
x1::T
x2::T
x3::T
x4::T
x5::T
end
# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
for i in 1:size(du, 1)
mul!(view(du, i, :), D, view(u, i, :), args...)
end
end
T = Float64
xmin, xmax = T(0), T(1)
D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full = Matrix(D_SBP)101×101 Matrix{Float64}:
-141.176 173.529 -23.5294 … 0.0 0.0 0.0
-50.0 0.0 50.0 0.0 0.0 0.0
9.30233 -68.6047 0.0 0.0 0.0 0.0
3.06122 0.0 -60.2041 0.0 0.0 0.0
0.0 0.0 8.33333 0.0 0.0 0.0
0.0 0.0 0.0 … 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
⋮ ⋱ ⋮
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 … 0.0 0.0 0.0
0.0 0.0 0.0 -8.33333 0.0 0.0
0.0 0.0 0.0 60.2041 0.0 -3.06122
0.0 0.0 0.0 0.0 68.6047 -9.30233
0.0 0.0 0.0 -50.0 0.0 50.0
0.0 0.0 0.0 … 23.5294 -173.529 141.176At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.
println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
Range (min … max): 46.214 ns … 91.708 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 47.470 ns ┊ GC (median): 0.00%
Time (mean ± σ): 48.096 ns ± 2.592 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▅▇███▇▇▆▅▄▃▂▁ ▂▂▁▂▁ ▃
█████████████████▇▇▅▆▇▆█▆▄▃▅▃▁▃▄▁▁▃▁▁▁▃▁▁▁▄▅█████████▇▇▇▆▆▆ █
46.2 ns Histogram: log(frequency) by time 58.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 230 evaluations per sample.
Range (min … max): 321.687 ns … 678.613 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 376.265 ns ┊ GC (median): 0.00%
Time (mean ± σ): 379.058 ns ± 23.571 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▄▅▇██▅▅▄▄▃▃▁▁▁
▁▁▂▂▂▂▂▂▂▂▂▃▄▆▇▇▇▇▇████████████████▆▆▅▅▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁ ▄
322 ns Histogram: frequency by time 448 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
Range (min … max): 1.118 μs … 4.114 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.131 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.149 μs ± 127.122 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▆ ▁
██▇▅▄▅▅▁▁▁▁▁▁▃▅▅▃▁▃▁▁▁▁▁▁▁▃▁▃▆█▁▁▁▁▁▁▃▁▁▁▃▁▃▁▁▁▃▃▁▁▁▁▁▁▃▄▇▇ █
1.12 μs Histogram: log(frequency) by time 2.03 μs <
Memory estimate: 0 bytes, allocs estimate: 0.Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.
println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
Range (min … max): 1.271 μs … 2.998 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.280 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.293 μs ± 101.293 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▄▁▁▁▁▂▁▁▁▁▁▁▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
1.27 μs Histogram: frequency by time 2.13 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
Range (min … max): 2.504 μs … 5.823 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.624 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.658 μs ± 191.913 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▇▄
▂▂▄▇█████▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▂▂▁▁▂▁▂▁▂▁▁▁▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
2.5 μs Histogram: frequency by time 3.64 μs <
Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
Range (min … max): 6.352 μs … 18.214 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.400 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.501 μs ± 534.005 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▁ ▁▁▁ ▁
███▆▇▅▄▁▁▄▆▆▇▅▃▃▃▁▃▃▃▁▁▁▁▃▁▁▃▁▃▃▁▁▁▃▁▁▄▁▁▅▇███▇▆▅▅▅▅▅▅▁▄▅▄▄ █
6.35 μs Histogram: log(frequency) by time 8.83 μs <
Memory estimate: 0 bytes, allocs estimate: 0.Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).
println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 535 evaluations per sample.
Range (min … max): 210.129 ns … 324.006 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 214.456 ns ┊ GC (median): 0.00%
Time (mean ± σ): 216.493 ns ± 6.127 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▅▇███▇▅▄▃▂▁ ▁▃▃▄▄▃▂▁ ▃
▃▄▄▆████████████████▇▇▃▄▄▃▅▅▁▄▄▃▃▅▃▁▁▁▃▁▃▃▁▁▆▇██████████▇▆▇▇▇ █
210 ns Histogram: log(frequency) by time 235 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 182 evaluations per sample.
Range (min … max): 568.747 ns … 11.722 μs ┊ GC (min … max): 0.00% … 92.85%
Time (median): 610.643 ns ┊ GC (median): 0.00%
Time (mean ± σ): 617.414 ns ± 114.237 ns ┊ GC (mean ± σ): 0.18% ± 0.93%
▂▄▅▆▇█▅▅▄▃▁
▂▁▂▂▁▂▂▂▂▃▃▃▄▄▆▇█████████████▆▅▄▄▃▃▂▂▂▂▂▂▃▂▃▃▄▄▄▄▃▄▃▃▃▃▃▂▂▂▂▂ ▄
569 ns Histogram: frequency by time 680 ns <
Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 13.976 μs … 42.840 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 14.056 μs ┊ GC (median): 0.00%
Time (mean ± σ): 14.221 μs ± 1.186 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅ ▁
██▁▁▁▄▆▆▁▃▁▁▃▄▅▅▇█▄▁▃▁▃▁▄▃▁▄▁▄▃▃▁▃▃▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▄▆▇ █
14 μs Histogram: log(frequency) by time 22.8 μs <
Memory estimate: 0 bytes, allocs estimate: 0.Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.
println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations per sample.
Range (min … max): 210.520 ns … 295.326 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 213.974 ns ┊ GC (median): 0.00%
Time (mean ± σ): 216.042 ns ± 5.990 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▄▆██▇▆▅▃▁ ▁ ▂▃▄▄▃▁▁ ▂
▅▅▆████████████████▇▆▆▃▁▃▄▁▃▄▁▁▁▁▃▄▃▃▃▃▁▃▁▁▅▆█████████▇▇▇▇▇▆▇ █
211 ns Histogram: log(frequency) by time 234 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 192 evaluations per sample.
Range (min … max): 507.141 ns … 836.870 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 524.052 ns ┊ GC (median): 0.00%
Time (mean ± σ): 529.391 ns ± 18.967 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▆██▇▇▇▅▄▂
▂▂▂▃▃▅█████████████▇▆▅▄▄▄▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▃▃▄▄▄▃▃▃▄▃▃▃▃▂▂▂▂▂ ▄
507 ns Histogram: frequency by time 582 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 13.825 μs … 198.609 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.895 μs ┊ GC (median): 0.00%
Time (mean ± σ): 15.616 μs ± 9.554 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁ ▃ ▁ ▁
█▇██▆▇▇▆█▇█▅▅▄▄▁▁▄▁▁▁▁▄▁▁▄▁▁▃▁▃▁▃▁▄▁▁▁▁▃▃▁▄▃▃▃▄▄▄▅▄▅▆▆▆▆▅▅▆▆ █
13.8 μs Histogram: log(frequency) by time 65.9 μs <
Memory estimate: 0 bytes, allocs estimate: 0.Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.
println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 478 evaluations per sample.
Range (min … max): 224.077 ns … 358.996 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 227.412 ns ┊ GC (median): 0.00%
Time (mean ± σ): 229.624 ns ± 7.413 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▆██▆▄▁
▂▂▄▇███████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▁▂▁▁▁▁▁▂▁▂▂▂▂▂▃▃▃▄▄▃▃▃▂▂▂▂▂▂▂ ▃
224 ns Histogram: frequency by time 250 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 48.911 μs … 99.095 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 49.071 μs ┊ GC (median): 0.00%
Time (mean ± σ): 49.595 μs ± 2.313 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
██▁ ▁▂▂ ▂
███▅▃██▁▁▁▁▇▇▅▅▃▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇█████▇▅ █
48.9 μs Histogram: log(frequency) by time 58.2 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
Range (min … max): 5.479 μs … 15.278 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.517 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.627 μs ± 530.798 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▃ ▂▁ ▁
███▇▃▇▆▆▅▄▁▁▃▁▁▁▁▃▁▄▃▁▄▃▄▇██▇▅▅▆▅▆▄▅▄▄▅▄▄▅▅▃▃▁▄▄▃▄▄▃▃▄▃▅▄▄▆ █
5.48 μs Histogram: log(frequency) by time 8.86 μs <
Memory estimate: 0 bytes, allocs estimate: 0.These results were obtained using the following versions.
using InteractiveUtils
versioninfo()
using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
mode=PKGMODE_MANIFEST)Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × AMD EPYC 7763 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
[90137ffa] StaticArrays v1.9.18
[09ab397b] StructArrays v0.7.3
[9f78cca6] SummationByPartsOperators v0.5.91 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`