Monday, December 1, 2025

Using db_bench to measure RocksDB performance with gcc and clang

This has results for db_bench, a benchmark for RocksDB, when compiling it with gcc and clang. On one of my servers I saw a regression on one of the tests (fillseq) when compiling with gcc. The result on that server didn't match what I measured on two other servers. So I repeated tests after compiling with clang to see if I could reproduce it.

tl;dr

  • a common outcome is
    • ~10% more QPS with clang+LTO than with gcc
    • ~5% more QPS with clang than with gcc
  • the performance gap between clang and gcc is larger in RocksDB 10.x than in earlier versions

Variance

I always worry about variance when I search for performance bugs. Variance can be misinterpreted as a performance regression and I strive to avoid that because I don't want to file bogus performance bugs.

Possible sources of variance are:

  • the compiler toolchain
    • a bad code layout might hurt performance by increasing cache and TLB misses
  • RocksDB
    • the overhead from compaction is intermittent and the LSM tree layout can help or hurt CPU overhead during reads
  • hardware
    • sources include noisy neighbors on public cloud servers, insufficient CPU cooling and CPU frequency management that is too clever
  • benchmark client
    • the way in which I run tests can create more or less variance and more information on that is here and here

Software

I used RocksDB versions 6.29.5, 7.10.2, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.

I compiled each version three times:

  • gcc using version 13.3.0
  • clang - using version 18.3.1
  • clang+LTO - using version 18.3.1, where LTO is link-time optimization
The build command lines are below

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for gcc
make "${flags[@]}" static_lib db_bench

# for clang
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make "${flags[@]}" static_lib db_bench

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make USE_LTO=1 "${flags[@]}" static_lib db_bench

On the small servers I used the LRU block cache. On the large server I used hyper clock when possible:
  • lru_cache was used for versions 7.6 and earlier
  • hyper_clock_cache was used for versions 7.7 through 8.5
  • auto_hyper_clock_cache was used for versions 8.5+

Hardware

I used two small servers and one large server, all run Ubuntu 22.04:

  • pn-53
    • Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post
    • benchmarks are run with 1 client (thread)
  • arm
    • an ARM server from the Google cloud -- c4a-standard-8-lssd with 8 cores and 32G of RAM, 2 local SSDs using RAID 0 and ext-4
    • benchmarks are run with 1 client (thread)
  • hetzner
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM, 2 SSDs with RAID 1 (3.8T each) using ext4
    • benchmarks are run with 36 clients (threads)

Benchmark

Overviews on how I use db_bench are here and here.

Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.

The benchmark steps that I focus on are:
  • fillseq
    • load RocksDB in key order with 1 thread
  • revrangeww, fwdrangeww
    • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
  • readww
    • do point queries with a rate-limited writer. Report performance for the point queries.
  • overwrite
    • overwrite (via Put) random keys

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
    (QPS for my version / QPS for RocksDB 6.29 compiled with gcc)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than the base version. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here.

Results: fillseq

Results for the pn53 server

  • clang+LTO provides ~15% more QPS than gcc in RocksDB 10.8
  • clang provides ~11% more QPS than gcc in RocksDB 10.8
  • Results for the Arm server

    • I am fascinated by how stable the QPS is here for clang and clang+LTO
    • clang+LTO and clang provide ~3% more QPS than gcc in RocksDB 10.8

    Results for the Hetzner server

    • I don't show results for 6.29 or 7.x to improve readability
    • the performance for RocksDB 10.8.3 with gcc is what motivated me to repeat tests with clang
    • clang+LTO and clang provide ~20% more QPS than gcc in RocksDB 10.8

    Results: revrangeww

    Results for the pn53 server

    • clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8
    • clang provides ~6% more QPS than gcc in RocksDB 10.8

    Results for the Arm server

  • clang+LTO provides ~11% more QPS than gcc in RocksDB 10.8
  • clang provides ~6% more QPS than gcc in RocksDB 10.8
  • Results for the Hetzner server

    • I don't show results for 6.29 or 7.x to improve readability
    • clang+LTO provides ~8% more QPS than gcc in RocksDB 10.8
    • clang provides ~3% more QPS than gcc in RocksDB 10.8
    • Results: fwdrangeww

      Results for the pn53 server

    • clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8
    • clang provides ~4% more QPS than gcc in RocksDB 10.8
    • Results for the Arm server

    • clang+LTO provides ~13% more QPS than gcc in RocksDB 10.8
    • clang provides ~7% more QPS than gcc in RocksDB 10.8
    • Results for the Hetzner server

      • I don't show results for 6.29 or 7.x to improve readability
      • clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8
      • clang provides ~1% more QPS than gcc in RocksDB 10.8
      • Results: readww

        Results for the pn53 server

      • clang+LTO provides ~6% more QPS than gcc in RocksDB 10.8
      • clang provides ~5% less QPS than gcc in RocksDB 10.8
      • Results for the Arm server

      • clang+LTO provides ~14% more QPS than gcc in RocksDB 10.8
      • clang provides ~2% more QPS than gcc in RocksDB 10.8
      • Results for the Hetzner server

        • I don't show results for 6.29 or 7.x to improve readability
        • clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8
        • clang provides ~1% more QPS than gcc in RocksDB 10.8
        • Results: overwrite

          Results for the pn53 server

        • clang+LTO provides ~6% less QPS than gcc in RocksDB 10.8
        • clang provides ~8% less QPS than gcc in RocksDB 10.8
        • but for most versions there is similar QPS for gcc, clang and clang+LTO
        • Results for the Arm server

          • QPS is similar for gcc, clang and clang+LTO

          Results for the Hetzner server

          • I don't show results for 6.29 or 7.x to improve readability
          • clang+LTO provides ~2% more QPS than gcc in RocksDB 10.8
          • clang provides ~1% more QPS than gcc in RocksDB 10.8
          • Saturday, November 29, 2025

            Using sysbench to measure how Postgres performance changes over time, November 2025 edition

            This has results for the sysbench benchmark on a small and big server for Postgres versions 12 through 18. Once again, Postgres is boring because I search for perf regressions and can't find any here. Results from MySQL are here and MySQL is not boring.

            While I don't show the results here, I don't see regressions when comparing the latest point releases with their predecessors -- 13.22 vs 13.23, 14.19 vs 14.20, 15.14 vs 15.15, 16.10 vs 16.11, 17.6 vs 17.7 and 18.0 vs 18.1.

            tl;dr

            • a few small regressions
            • many more small improvements
            • for write-heavy tests at high-concurrency there are many large improvements starting in PG 17

            Builds, configuration and hardware

            I compiled Postgres from source for versions 12.22, 13.22, 13.23, 14.19, 14.20, 15.14, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

            I used two servers:
            • small
              • an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.
            • big
              • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
              • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
              • 128G RAM
              • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
            Configuration files for the small server
            • Configuration files are here for Postgres versions 1213141516 and 17.
            • For Postgres 18 I used io_method=sync and the configuration file is here.
            Configuration files for the big server
            • Configuration files are here for Postgres versions 1213141516 and 17.
            • For Postgres 18 I used io_method=sync and the configuration file is here.
            Benchmark

            I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB.

            The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. On the small server the benchmark is run with 1 client and 1 table with 50M rows. On the big server the benchmark is run with 12 clients and 8 tables with 10M rows per table. 

            The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

            Results

            The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

            I provide charts below with relative QPS. The relative QPS is the following:
            (QPS for some version) / (QPS for Postgres 12.22)
            When the relative QPS is > 1 then some version is faster than Postgres 12.12.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than Postgres 12.22.

            Values from iostat and vmstat divided by QPS are here for the small server and the big serverThese can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

            The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

            Results: point queries

            This is from the small server.
            • a large improvement arrived in Postgres 17 for the hot-points test
            • otherwise results have been stable from 12.22 through 18.1
            This is from the big server.
            • a large improvement arrived in Postgres 17 for the hot-points test
            • otherwise results have been stable from 12.22 through 18.1
            Results: range queries without aggregation

            This is from the small server.
            • there are small improvements for the scan test
            • otherwise results have been stable from 12.22 through 18.1
            This is from the big server.
            • there are small improvements for the scan test
            • otherwise results have been stable from 12.22 through 18.1
            Results: range queries with aggregation

            This is from the small server.
            • there are small improvements for a few tests
            • otherwise results have been stable from 12.22 through 18.1
            This is from the big server.
            • there might be small regressions for a few tests
            • otherwise results have been stable from 12.22 through 18.1
            Results: writes

            This is from the small server.
            • there are small improvements for most tests
            • otherwise results have been stable from 12.22 through 18.1
            This is from the big server.
            • there are large improvements for half of the tests
            • otherwise results have been stable from 12.22 through 18.1
            From vmstat results for update-index the per-operation CPU overhead and context switch rate are much smaller starting in Postgres 17.7. The CPU overhead is about 70% of what it was in 16.11 and the context switch rate is about 50% of the rate for 16.11. Note that context switch rates are often a proxy for mutex contention.

            Friday, November 28, 2025

            Using sysbench to measure how MySQL performance changes over time, November 2025 edition

            This has results for the sysbench benchmark on a small and big server for MySQL versions 5.6 through 9.5. The good news is that the arrival rate of performance regressions has mostly stopped as of 8.0.43. The bad news is that there were large regressions from 5.6 through 8.0.

            tl;dr for low-concurrency tests

            • for point queries
              • MySQL 5.7.44 gets about 10% less QPS than 5.6.51
              • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
            • for range queries without aggregation
              • MySQL 5.7.44 gets about 15% less QPS than 5.6.51
              • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
            • for range queries with aggregation
              • MySQL 5.7.44 is faster than 5.6.51 for two tests, as fast for one and gets about 15% less QPS for the other five
              • MySQL 8.0 to 9.5 are faster than 5.6.51 for one test, as fast for one and get about 30% less QPS for the other six
            • for writes
              • MySQL 5.7.44 gets between 10% and 20% less QPS than 5.6.51 for most tests
              • MySQL 8.0 to 9.5 get between 40% to 50% less QPS than 5.6.51 for most tests
            tl;dr for high-concurrency tests
            • for point queries
              • for most tests MySQL 5.7 to 9.5 get at least 1.5X more QPS than 5.6.51
              • for tests that use secondary indexes MySQL 5.7 to 9.5 get about 25% less QPS than 5.6.51
            • for range queries without aggregation
              • MySQL 5.7.44 gets about 10% less QPS than 5.6.51
              • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
            • for range queries with aggregation
              • MySQL 5.7.44 is faster than 5.6.51 for six tests, as fast for one test and gets about 20% less QPS for one test
              • MySQL 8.0 to 9.5 are a lot faster than 5.6.51 for two tests, about as fast for three tests and gets between 10% and 30% less QPS for the other three tests
            • for writes
              • MySQL 5.7.44 gets more QPS than 5.6.51 for all tests
              • MySQL 8.0 to 9.5 get more QPS than 5.6.51 for all tests

            Builds, configuration and hardware

            I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

            I used two servers:
            • small
              • an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.
            • big
              • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
              • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
              • 128G RAM
              • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
            The config files are here:
            Benchmark

            I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB.

            The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. On the small server the benchmark is run with 1 client and 1 table with 50M rows. On the big server the benchmark is run with 40 clients and 8 tables with 10M rows per table. 

            The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

            Results

            The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

            I provide charts below with relative QPS. The relative QPS is the following:
            (QPS for some version) / (QPS for MySQL 5.6.51)
            When the relative QPS is > 1 then some version is faster than MySQL 5.6.51.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than MySQL 5.6.51.

            Values from iostat and vmstat divided by QPS are here for the small server and the big serverThese can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

            The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

            Results: point queries

            This is from the small server.
            • MySQL 5.7.44 gets about 10% less QPS than 5.6.51
            • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
            • There are few regressions after MySQL 8.0
            • New CPU overheads explain the regressions. See the vmstat results for the hot-points test.
            This is from the large server.
            • For most point query tests MySQL 5.7 to 9.5 get at least 1.5X more QPS than 5.6.51
              • MySQL 5.7 to 9.5 use less CPU, see vmstat results for the hot-points test.
            • For tests that use secondary indexes (*-si) MySQL 5.7 to 9.5 get about 25% less QPS than 5.6.51.
              • This result is similar to what happens on the small server above.
              • The regressions are from extra CPU overhead, see vmstat results
            • MySQL 5.7 does better than 8.0 to 9.5. There are few regressions after MySQL 8.0.
            Results: range queries without aggregation

            This is from the small server.
            • MySQL 5.7.44 gets about 15% less QPS than 5.6.51
            • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
            • There are few regressions after MySQL 8.0
            • New CPU overheads explain the regressions. See the vmstat results for the scan test.
            This is from the large server.
            • MySQL 5.7.44 gets about 10% less QPS than 5.6.51
            • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
            • There are few regressions after MySQL 8.0
            • New CPU overheads explain the regressions. See the vmstat results for the scan test.
            Results: range queries with aggregation

            This is from the small server.
            • for the read-only-distinct test, MySQL 5.7 to 9.5 are faster than 5.6.51
            • for the read-only_range=X tests
              • with the longest range scan (*_range=10000), MySQL 5.7.44 is faster than 5.6.51 and 8.0 to 9.5 have the same QPS as 5.6.51
              • with shorter range scans (*_range=100 & *_range=10) MySQL 5.6.51 is faster than 5.7 to 9.5. This implies that the regressions are from code above the storage engine layer.
              • From vmstat results the perf differences are explained by CPU overheads
            • for the other tests
              • MySQL 5.7.44 gets about 15% less QPS than 5.6.51
              • MySQL 8.0 to 9.5 get about 30% less QPS than 5.6.51
              • From vmstat results for read-only-count the reason is new CPU overhead
            This is from the large server.
            • for the read-only-distinct test, MySQL 5.7 to 9.5 are faster than 5.6.51
            • for the read-only_range=X tests
              • MySQL 5.7.44 is as fast as 5.6.51 for the longest range scan and faster than 5.6.51 for the shorter range scans
              • MySQL 8.0 to 9.5 are much faster than 5.6.51 for the longest range scan and somewhat faster for the shorter range scans
              • From vmstat results the perf differences are explained by CPU overheads and possible from changes in mutex contention
            • for the other tests
              • MySQL 5.7.44 gets about 20% less QPS than 5.6.51 for read-only-count and about 10% more QPS than 5.6.51 for read-only-simple and read-only-sum
              • MySQL 8.0 to 9.5 get about 30% less QPS than 5.6.51 for read-only-count and up to 20% less QPS than 5.6.51 for read-only-simple and read-only-sum
              • From vmstat results for read-only-count the reason is new CPU overhead
            Results: writes

            This is from the small server.
            • For most tests
              • MySQL 5.7.44 gets between 10% and 20% less QPS than 5.6.51
              • MySQL 8.0 to 9.5 get between 40% to 50% less QPS than 5.6.51
              • From vmstat results for the insert test, MySQL 5.7 to 9.5 use a lot more CPU
            • For the update-index test
              • MySQL 5.7.44 is faster than 5.6.51
              • MySQL 8.0 to 9.5 get about 10% less QPS than 5.6.51
              • From vmstat metrics MySQL 5.6.51 has more mutex contention
            • For the update-inlist test
              • MySQL 5.7.44 is as fast as 5.6.51
              • MySQL 8.0 to 9.5 get about 30% less QPS than 5.6.51
              • From vmstat metrics MySQL 5.6.51 has more mutex contention
            This is from the large server and the y-axis truncates the result for the update-index test to improve readability for the other results.
            • For all tests MySQL 5.7 to 9.5 get more QPS than 5.6.51
              • From vmstat results for the write-only test MySQL 5.6.51 uses more CPU and has more mutex contention.
            • For some tests (read-write_range=X) MySQL 8.0 to 9.5 get less QPS than 5.7.44
              • These are the classic sysbench transaction with different range scan lengths and the performance is dominated by the range query response time, thus 5.7 is fastest.
            • For most tests MySQL 5.7 to 9.5 have similar perf with two exceptions
              • For the delete test, MySQL 8.0 to 9.5 are faster than 5.7. From vmstat metrics 5.7 uses more CPU and has more mutex contention than 8.0 to 9.5.
              • For the update-inlist test, MySQL 8.0 to 9.5 are faster than 5.7. From vmstat metrics 5.7 uses more CPU than 8.0 to 9.5.
            This is also from the large server and does not truncate the update-index test result.

            Saturday, November 22, 2025

            Challenges compiling old C++ code on modern Linux

            I often compile old versions of MySQL, MariaDB, Postgres and RocksDB in my search for performance regressions. Compiling is easy with Postgres as they do a great job at avoiding compilation warnings and I never encounter broken builds. Certainly the community gets the credit for this, but I suspect their task is easier because they use C.  This started as a LinkedIn post.

            I expect people to disagree, and I am far from a C++ expert, but here goes ...

            tl;dr - if you maintain widely used header files (widely used by C++ projects) consider not removing that include that you don't really need (like <cstdint>) because such removal is likely to break builds for older releases of projects that use your include.

            I have more trouble compiling older releases of C++ projects. For MySQL I have a directory in github that includes patches that must be applied. And for MySQL I have to patch all 5.6 versions, 5.7 versions up to 5.7.33 and 8.0 versions up to 8.0.23. The most common reason for the patch is missing C++ includes (like <cstdint>).

            For RocksDB with gcc I don't have to patch files but I need to use gcc-11 for RocksDB 6.x and gcc-12 for RocksDB 7.x.

            For RocksDB with clang I don't have to patch files for RocksDB 8.x, 9.x and 10.x while I do have to patch 6.x and 7.x. For RocksDB 7.10 I need to edit two files to add <cstdint>. The files are:

            • table/block_based/data_block_hash_index.h
            • util/string_util.h
            All of this is true for Ubuntu 24.04 with clang 18.1.3 and gcc 13.3.0.

            One more detail, for my future self, the command line I use to compile RocksDB with clang is one of the following:
            • Rather than remember which of V= and VERBOSE= that I need, I just use both
            • I get errors if I don't define AR and RANLIB when using clang
            • While clang-18 installs clang and clang++ binaries, to get the llvm variants of ar and ranlib I need to use llvm-ar-18 and llvm-ranlib-18 rather than llvm-ar and llvm-ranlib

            # without link-time optimization
            AR=llvm-ar-18 RANLIB=llvm-ranlib-18 \
            CC=clang CXX=clang++ \

            make \
            DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j... \

            static_lib db_bench

            # with link-time optimization
            AR=llvm-ar-18 RANLIB=llvm-ranlib-18 \
            CC=clang CXX=clang++ \
            make USE_LTO=1 \
            DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j... \
            static_lib db_bench

            Thursday, October 23, 2025

            How efficient is RocksDB for IO-bound, point-query workloads?

            How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB as this is extra overhead beyond what libc and the kernel require to perform an IO. Here my focus is on KV pairs that are smaller than the typical RocksDB block size that I use -- 8kb.

            By IO efficiency I mean:
                (storage read IOPs from RocksDB benchmark / storage read IOPs from fio)

            And I measure this in a setup where RocksDB doesn't get much benefit from RocksDB block cache hits (database size > 400G, block cache size was 16G).

            This value will be less than 1.0 in such a setup. But how much less than 1.0 will it be? On my hardware the IO efficiency was ~0.85 at 1 client and ~0.88 at 6 clients. Were I to use slower storage, such as an SSD where read latency was ~200 usecs at io_depth=1 then the IO efficiency would be closer to 0.95.

             Note that:

            • IO efficiency increases (decreases) when SSD read latency increases (decreases)
            • IO efficiency increases (decreases) when the RocksDB CPU overhead decreases (increases)
            • RocksDB QPS increases by ~8% for IO-bound workloads when --block_align is enabled

            The overheads per 8kb block read on my test hardware were:

            • about 11 microseconds from libc + kernel
            • between 6 and 10 microseconds from RocksDB
            • ~100 usecs of IO latency at io_depth=1, ~150 usecs at io_depth=6

            A simple performance model

            A simple model to predict the wall-clock latency for reading a block is:
                userland CPU + libc/kernel CPU + device latency

            For fio I assume that userland CPU is zero, I measured libc/kernel at ~11 usecs and will estimate that device latency is ~91 usecs. My device latency estimate comes from read-only benchmarks with fio where fio reports the average latency as 102 usecs which includes 11 usecs of CPU from libc+kernel and 91 = 102 - 11.

            This model isn't perfect, as I will show below when reporting results for RocksDB, but it might be sufficient. But it allows you to predict latencies and IO efficiency when the RocksDB CPU overhead is increased or reduced.

            Q and A

            The RocksDB API could function as a universal API for storage engines, and if new DBMS built on that then it would be possible to combine new DBMS with new storage engines much faster than what is possible today.

            Persistent hash indexes are not widely implemented, but getting one that uses the RocksDB API would be interesting for workloads such as the one I run here. However, there are fewer use cases for a hash index (no range queries) than for a range index like an LSM so it is harder to justify the investment in such work.

            Q: What is the CPU overhead from libc + kernel per 8kb read?
            A: About 10 microseconds on this CPU.

            Q: Can you write your own code that will be faster than RocksDB for such a workload?
            A: Yes, you can

            Q: Should you write your own library for this?
            A: It depends on how many features you need and the opportunity cost in spending time writing that code vs doing something else.

            Q: Will RocksDB add features to make this faster?
            A: That is for them to answer. But all projects have a complexity budget. Code can become too expensive to maintain when that budget is exceeded. There is also the opportunity cost to consider as working on this delays work on other features.

            Q: Does this matter?
            A: It matters more when storage is fast (read latency less than 100 usecs). As read response time grows the CPU overhead from RocksDB becomes much less of an issue.

            Benchmark hardware

            I ran tests on a Beelink SER7 with a Ryzen 7 7840HS CPU that has 8 cores and 32G of RAM. The storage device a Crucial is CT1000P3PSSD8 (Crucial P3, 1TB) using ext-4 with discard enabled. The OS is Ubuntu 24.04.

            From fio, the average read latency for the SSD is 102 microseconds using O_DIRECT with io_depth=1 and the sync engine.

            CPU frequency management makes it harder to claim that the CPU runs at X GHz, but the details are:

            $ cpupower frequency-info

            analyzing CPU 5:
              driver: acpi-cpufreq
              CPUs which run at the same hardware frequency: 5
              CPUs which need to have their frequency coordinated by software: 5
              maximum transition latency:  Cannot determine or is not supported.
              hardware limits: 1.60 GHz - 3.80 GHz
              available frequency steps:  3.80 GHz, 2.20 GHz, 1.60 GHz
              available cpufreq governors: conservative ... powersave performance schedutil
              current policy: frequency should be within 1.60 GHz and 3.80 GHz.
                              The governor "performance" may decide which speed to use
                              within this range.
              current CPU frequency: Unable to call hardware
              current CPU frequency: 3.79 GHz (asserted by call to kernel)
              boost state support:
                Supported: yes
                Active: no

            Results from fio

            I started with fio using a command-line like the following for NJ=1 and NJ=6 to measure average IOPs and the CPU overhead per IO.

            fio --name=randread --rw=randread --ioengine=sync --numjobs=$NJ --iodepth=1 \
              --buffered=0 --direct=1 \
              --bs=8k \
              --size=400G \
              --randrepeat=0 \
              --runtime=600s --ramp_time=1s \
              --filename=G_1:G_2:G_3:G_4:G_5:G_6:G_7:G_8  \
              --group_reporting

            Results are:

            legend:
            * iops - average reads/s reported by fio
            * usPer, syPer - user, system CPU usecs per read
            * cpuPer - usPer + syPer
            * lat.us - average read latency in microseconds
            * numjobs - the value for --numjobs with fio

            iops    usPer   syPer   cpuPer  lat.us  numjobs
             9884   1.351    9.565  10.916  101.61  1
            43782   1.379   10.642  12.022  136.35  6

            Results from RocksDB

            I used an edited version of my benchmark helper scripts that run db_bench. In this case the sequence of tests was:

            1. fillseq - loads the LSM tree in key order
            2. revrange - I ignore the results from this
            3. overwritesome - overwrites 10% of the KV pairs
            4. flush_mt_l0 - flushes the memtable, waits, compacts L0 to L1, waits
            5. readrandom - does random point queries when LSM tree has many levels
            6. compact - compacts LSM tree into one level
            7. readrandom2 - does random point queries when LSM tree has one level, bloom filters enabled
            8. readrandom3 - does random point queries when LSM tree has one level, bloom filters disabled
            I use readrandom, readrandom2 and readrandom3 to vary the amount of work that RocksDB must do per query and measure the CPU overhead of that work. The most work happens with readrandom as the LSM tree has many levels and there are bloom filters to check. The least work happens with readrandom3 as the LSM tree only has one level and there are no bloom filters to check.

            Initially I ran tests with --block_align not set as that reduces space-amplification (less padding) but 8kb reads are likely to cross file system page boundaries and become larger reads. But given the focus here is on IO efficiency, I used --block_align. 

            A summary of the results for db_bench with 1 user (thread) and 6 users (threads) is:

            --- 1 user
            qps     iops    reqsz   usPer   syPer   cpuPer  rx.lat  io.lat  test
            8282     8350   8.5     11.643   7.602  19.246  120.74  101     readrandom
            8394     8327   8.7      9.997   8.525  18.523  119.13  105     readrandom2
            8522     8400   8.2      8.732   8.718  17.450  117.34  100     readrandom3

            --- 6 users
            38391   38628   8.1     14.645   7.291  21.936  156.27  134     readrandom
            39359   38623   8.3     10.449   9.346  19.795  152.43  144     readrandom2
            39669   38874   8.0      9.459   9.850  19.309  151.24  140     readrandom3

            From the following:
            • IO efficiency is approximately 0.84 at 1 client and 0.88 at 6 clients
            • With 1 user RocksDB adds between 6.534 and 8.330 usecs of CPU time per query compared to fio depending on the amount of work it has to do. 
            • With 6 users RocksDB adds between 7.287 to 9.914 usecs of CPU time per query
            • IO latency as reported by RocksDB is ~20 usecs larger than as reported by iostat. But I have to re-read the RocksDB source code to understand where and how it is measured.
            legend:
            * io.eff - IO efficiency as (db_bench storage read IOPs / fio storage read IOPs)
            * us.inc - incremental user CPU usecs per read as (db_bench usPer - fio usPer)
            * cpu.inc - incremental total CPU usecs per read as (db_bench cpuPer - fio cpuPer)

            --- 1 user

                    io.eff          us.inc          cpu.inc         test
                    ------          ------          ------
                    0.844           10.292           8.330          readrandom
                    0.842            8.646           7.607          readrandom2
                    0.849            7.381           6.534          readrandom3

            --- 6 users

                    io.eff          us.inc          cpu.inc         test
                    ------          ------          ------
                    0.882           13.266           9.914          readrandom
                    0.882            9.070           7.773          readrandom2
                    0.887            8.080           7.287          readrandom3

            Evaluating the simple performance model

            I described a simple performance model earlier in this blog post and now it is time to see how well it does for RocksDB. First I will use values from the 1 user/client/thread case:
            • IO latency is ~91 usecs per fio
            • libc+kernel CPU overhead is ~11 usecs per fio
            • RocksDB CPU overhead is 8.330, 7.607 and 6.534 usecs for readrandom, *2 and *3
            The model is far from perfect as it predicts that RocksDB will sustain:
            • 9063 IOPs for readrandom, when it actually did 8350
            • 9124 IOPs for readrandom2, when it actually did 8327
            • 9214 IOPs for readrandom3, when it actually did 8400
            Regardless, model is a good way to think about the problem.

            The impact from --block_align

            RocksDB QPS increases by between 7% and 9% when --block_align is enabled. Enabling it reduces read-amp and increases space-amp. But given the focus here is on IO efficiency I prefer to enable it. RocksDB QPS increases with it enabled because fewer storage read requests cross file system page boundaries, thus the average read size from storage is reduced (see the reqsz column below).

            legend:
            * qps - RocksDB QPS
            * iops - average reads/s reported by fio
            * reqsz - average read request size in KB per iostat
            * usPer, syPer, cpuPer - user, system and (user+system) CPU usecs per read
            * rx.lat - average read latency in microseconds, per RocksDB
            * io.lat - average read latency in microseconds, per iostat
            * test - the db_bench test name

            - block_align disabled
            qps     iops    reqsz   usPer   syPer   cpuPer  rx.lat  io.lat  test
            7629     7740   8.9     12.133   8.718  20.852  137.92  111     readrandom
            7866     7813   9.1     10.094   9.098  19.192  127.12  115     readrandom2
            7972     7862   8.6      8.931   9.326  18.257  125.44  110     readrandom3

            - block_align enabled
            qps     iops    reqsz   usPer   syPer   cpuPer  rx.lat  io.lat  test
            8282     8350   8.5     11.643   7.602  19.246  120.74  101     readrandom
            8394     8327   8.7      9.997   8.525  18.523  119.13  105     readrandom2
            8522     8400   8.2      8.732   8.718  17.450  117.34  100     readrandom3

            Async IO in RocksDB

            Per the wiki, RocksDB can do async IO for point queries that use MultiGet. That is done via coroutines and requires linking with Folly. My builds do not support that today and because my focus is on efficiency rather than throughput I did not try it for this test.

            Flamegraphs

            Flamegraphs are here for readrandom, readrandom2 and readrandom3.

            A summary of where CPU time is spent based on the flamegraphs.

            Legend:
            * rr, rr2, rr3 - readrandom, readrandom2, readrandom3
            * libc+k - time in libc + kernel
            * checksm - verify data block checksum after read
            * IBI:Sk - IndexBlockIter::SeekImpl
            * DBI:Sk - DataBlockIter::SeekImpl
            * LRU - lookup, insert blocks in the LRU, update metrics
            * bloom - check bloom filters
            * BSI - BinarySearchIndexReader::NewIterator
            * File - FilePicker::GetNextFile, FindFileInRange
            * other - other parts of the call stack, from DBImpl::Get and functions called by it

            rr is readrandom, rr2 is readrandom2, rr3 is readrandom3

            Percentage of samples
                    rr      rr2     rr3
            libc+k  37.30   42.22   50.92
            checksm  3.76    2.66    2.91
            IBI:Sk   7.07    7.36    7.76
            DBI:Sk   3.05    2.15    1.96
            LRU      5.19    6.19    6.02
            bloom   18.35    8.14    0
            BSI      2.28    4.02    3.12
            File     3.74    3.34    4.44
            other   19.26   23.92   22.87











            Monday, October 20, 2025

            Determine how much concurrency to use on a benchmark for small, medium and large servers

            What I describe here works for me given my goal, which is to find performance regressions. A benchmark run at low concurrency is used to find regressions from CPU overhead. A benchmark run at high concurrency is used to find regressions from mutex contention. A benchmark run at medium concurrency might help find both.

            My informal way for classifying servers by size is:

            • small - has less than 10 cores
            • medium - has between 10 and 20 cores
            • large - has more than 20 cores
            How much concurrency?

            I almost always co-locate benchmark clients and the DBMS on the same server. This comes at a cost (less CPU and RAM is available for the DBMS) and might have odd artifacts because clients in the real world are usually not co-located. But it has benefits that matter to me. First, I don't worry about variance from changes in network latency. Second, this is much easier to setup.

            I try to not oversubscribe the CPU when I run a benchmark. For benchmarks where there are few waits for reads from or writes to storage, then I will limit the number of benchmark users so that the concurrent connection count is less than the number of CPU cores (cores, not VPUs) and I almost always use servers with Intel Hyperthreads and AMD SMT disabled. I do this because DBMS performance suffers when the CPU is oversubscribed and back when I was closer to production we did our best to avoid that state.

            Even for benchmarks that have some benchmark steps where the workload will have IO waits, I will still limit the amount of concurrency unless all benchmark steps that I measure will have IO waits.

            Assuming a benchmark is composed of a sequence of steps (at minimum: load, query) then I consider the number of concurrent connections per benchmark user. For sysbench, the number of concurrent connections is the same as the number of users, although sysbench uses the --threads argument to set the number of users. I am just getting started with TPROC-C via HammerDB and that appears to be like sysbench with one concurrent connection per virtual user (VU).

            For the Insert Benchmark the number of concurrent connections is 2X the number of users on the l.i1 and l.i2 steps and then 3X the number of users on the range-query read-write steps (qr*) and the point-query read-write steps (qp*). And whether or not there are IO-waits for these users is complicated, so I tend to configure the benchmark so that the number of users is no more than half the number of CPU cores.

            Finally, I usually set the benchmark concurrency level to be less than the number of CPU cores because I want to leave some cores for the DBMS to do the important background work, which is mostly MVCC garbage collection -- MyRocks compaction, InnoDB purge and dirty page writeback, Postgres vacuum.

            Thursday, October 16, 2025

            Why is RocksDB spending so much time handling page faults?

            This week I was running benchmarks to understand how fast RocksDB could do IO, and then compared that to fio to understand the CPU overhead added by RocksDB. While looking at flamegraphs taken during the benchmark I was confused that about 20% of the samples were from page fault handling. This confused me at first.

            The lesson here is to run your benchmark long enough to reach a steady state before you measure things or there will be confusion. And I was definitely confused when I first saw this. Perhaps my post saves time for the next person who spots this.

            The workload is db_bench with a database size that is much larger than memory and read-only microbenchmarks for point lookups and range scans.

            Then I wondered if this was a transient issue that occurs while RocksDB is warming up the block cache and growing process RSS until the block cache has been fully allocated.

            While b-trees as used by Postgres and MySQL will do a large allocation at process start, RocksDB does an allocation per block read, and when the block is evicted then the allocation is free'd. This can be a stress test for a memory allocator which is why jemalloc and tcmalloc work better than glibc malloc for RocksDB. I revisit the mallocator topic every few years and my most recent post is here.

            In this case I use RocksDB with jemalloc. Even though per-block allocations are transient, the memory used by jemalloc is mostly not transient. While there are cases where jemalloc an return memory to the OS, with my usage that is unlikely to happen.

            Were I to let the benchmark run for a long enough time, then eventually jemalloc would finish getting memory from the OS. However, my tests were running for about 10 minutes and doing about 10,000 block reads per second while I had configured RocksDB to use a block cache that was at least 36G and the block size was 8kb. So my tests weren't running long enough for the block cache to fill, which means that during the measurement period:

            • jemalloc was still asking for memory
            • block cache eviction wasn't needed and after each block read a new entry was added to the block cache
            The result in this example is 22.69% of the samples are from page fault handling. That is the second large stack from the left. The RocksDB code where it happens is rocksdb::BlockFetcher::ReadBlockContents.

            When I run the benchmark for more time, the CPU overhead from page fault handling goes away.




            Using db_bench to measure RocksDB performance with gcc and clang

            This has results for db_bench, a benchmark for RocksDB, when compiling it with gcc and clang. On one of my servers I saw a regression on one...