Wednesday, January 14, 2026

Debugging regressions with Postgres in IO-bound sysbench

I explained in this post that there is a possible performance regression for Postgres with IO-bound sysbench. It arrived in Postgres 16 and remains in Postgres 18. I normally run sysbench with a cached database, but I had a spare server so I repeated tests with an IO-bound workload.

The bad news for me is that I need to spend more time explaining the problem. The good news for me is that I learn more about Postgres by doing this. And to be clear, I have yet to explain the regression but this post documents my debugging efforts.

sysbench

This post explains how I use sysbench. Note that modern sysbench is a framework for running benchmarks and it comes with many built-in tests. By framework I mean that it includes much plumbing for several DBMS and you can write tests in Lua with support for the Lua JIT so the client side of the tests use less CPU.

The sysbench framework has been widely used in the MySQL community for a long time. Originally it hard-coded one (or perhaps a few tests) and in too many cases by sysbench people mean the classic (original) sysbench transaction that is a mix of point queries, range queries and writes. The classic transaction is now implemented by oltp_read_write.lua with help from oltp_common.lua.

The oltp_read_write.lua test is usually not good at detecting regressions but in this case (the problem motivating me to write this blog post) it has detected the regression.

How to find regressions

I have opinions about how to run DBMS benchmarks. For example, be careful about running read-only benchmarks for an LSM because the state (shape) of the LSM tree can have a large impact on CPU overhead and the LSM state (shape) might be in the same (bad or good) state for the duration of the test. With read-write tests the LSM state should cycle between better and worse states.

But traditional b-tree storage engines also have states. One aspect of that is MVCC debt -- write-heavy tests increase the debt and the engine doesn't always do a great job of limiting that debt. So I run sysbench tests in a certain order to both create and manage MVCC debt while trying to minimize noise.

The order of the tests is listed here and the general sequence is:

  1. Load and index the tables
  2. Run a few read-only tests to let MVCC debt get reduced if it exists
  3. Optionally run more read-only tests (I usually skip these)
  4. Run write-heavy tests
  5. Do things to reduce MVCC debt (see here for Postgres and MySQL)
  6. Run read-only tests (after the tables+indexes have been subject to random writes)
  7. Run delete-only and then insert-only tests
Results

The possible regressions are:
  • update-zipf in Postgres 16, 17 and 18
  • write-only in Postgres 16, 17 and 18
  • read-write in Postgres 16, 17 and 18
  • insert in Postgres 18 - this reproduces in 18.0 and 18.1

Legend:
* relative QPS for Postgres 16.11, 17.7 and 18.1
* relative QPS is (QPS for my version / QPS for Postgres 15.15)

16.11   17.7    18.1
1.01    1.00    1.01    update-inlist
1.06    1.03    1.04    update-index
1.04    1.04    1.04    update-nonindex
1.00    1.08    1.07    update-one
0.85    0.72    0.71    update-zipf
0.88    0.86    0.84    write-only_range=10000
0.71    0.82    0.81    read-write_range=100
0.74    0.78    0.82    read-write_range=10
1.06    1.03    1.00    delete
1.02    1.00    0.80    insert

Note that when the insert test has a relative QPS of 0.80 for Postgres 18.1, then 18.1 gets 80% of the throughput vs Postgres 15.5. So this is a problem to figure out.

Explaining the insert test for Postgres 18.1

I wrote useful but messy Bash scripts to make it easier to run and explain sysbench results. One of the things I do is collect results from vmstat and iostat per test and then summarize average and normalized values from them where normalized values are: (avg from iostat or vmstat / QPS). 

And then I compute relative values for them which is the following and the base case here is Postgres 15.15: (value from my version / value from the base case) 

From the results below I see that from Postgres 17.7 to 18.1
  • throughput decreases by ~20% in 18.1
  • cpu/o increased by ~17% in 18.1
  • cs/o increased by ~2.6% in 18.1
  • r/o increased by ~14% in 18.1
  • rKB/o increased by ~22% in 18.1
  • wKB/o increased by ~27% in 18.1
So Postgres 18.1 does a lot more IO during the insert test. Possible reasons are that either insert processing changed to be less efficient or there is more MVCC debt at the start of the insert test from the write-heavy tests that precede it and thus there is more IO from write-back and vacuum during the insert test which reduces throughput. At this point I don't know. A next step for me is to repeat the benchmark with the delete test removed as that immediately precedes the insert test (see here).

Legend:
* cpu/o - CPU per operation. This includes us and sy from vmstat but
          it isn't CPU microseconds.
* cs/o - context switches per operation
* r/o - reads from storage per operation
* rKB/o - KB read from storage per operation
* wKB/o - KB written to storage per operation
* o/s - operations/s, QPS, throughput
* dbms - Postgres version

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000780        5.017   0.407   8.937   24.2    34164   15.15
0.000755        5.090   0.409   9.002   24.648  34833   16.11
0.000800        5.114   0.418   9.146   24.626  34195   17.7
0.000939        5.251   0.477   11.186  31.197  27304   18.1

--- relative to first result
0.97            1.01    1.00    1.01    1.02    1.02    16.11
1.03            1.02    1.03    1.02    1.02    1.00    17.7
1.20            1.05    1.17    1.25    1.29    0.80    18.1

I also have results from pg_stat_all_tables collected at the end of the delete and insert tests for Postgres 17.7 and 18.1. Then I computed the difference as (results after insert test - results before insert test). From the results below I don't see a problem. Note that the difference for n_dead_tup is zero. While my analysis was limited to one of the 8 tables used for the benchmark, the values for the other 7 tables are similar.

The columns with a non-zero difference for Postgres 17.7 are:
  3996146   n_tup_ins
  3996146   n_live_tup
  3996146   n_mod_since_analyze
  3996146   n_ins_since_vacuum

The columns with a non-zero difference for Postgres 18.1 are:
  3191278   n_tup_ins
  3191278   n_live_tup
  3191278   n_mod_since_analyze
  3191278   n_ins_since_vacuum

Explaining the other tests for Postgres 16, 17 and 18

Above I listed the order in which the write-heavy tests are run. The regressions occur on the update-zipf, write-only and read-write tests. All of these follow the update-one test. 
  • There is no regression for write-only and read-write when I change the test order so these run prior to update-one and update-zipf
  • There is a regression if either or both of update-one and update-zip are run prior to write-only and read-write
I assume that either the amount of MVCC debt created by update-one and update-zipf is larger starting with Postgres 16 or that starting in Postgres 16 something changed so that Postgres is less effective at dealing with that MVCC debt.

Note that the update-one and update-zipf tests are a bit awkward. But this workload hasn't been a problem for InnoDB so I assume this is specific to Postgres (and vacuum). For the update-one test all updates are limited to one row per table (the first row in the table). And for the update-zipf test a zipfian distribution is used to select the rows to update. So in both cases a small number of rows receive most of the updates.

Results from pg_stat_all_tables collected immediately prior to update-one and then after update-zipf are here for Postgres 15.15 and 16.11. Then I computed the difference as (results after update-zip - results before update-one). 

The columns with a non-zero difference for Postgres 15.15 are:
  23747485   idx_scan
  23747485   idx_tup_fetch
  23747485   n_tup_upd
  22225868   n_tup_hot_upd
    -69576   n_live_tup
   3273433   n_dead_tup
  -1428177   n_mod_since_analyze

The columns with a non-zero difference for Postgres 16.11 are:
  23102012   idx_scan
  23102012   idx_tup_fetch
  23102012   n_tup_upd
  21698107   n_tup_hot_upd
   1403905   n_tup_newpage_upd
     -3568   n_live_tup
   2983730   n_dead_tup
  -2064095   n_mod_since_analyze

I also have the vmstat and iostat data for each of the tests. In all cases, after update-one, the amount of CPU and IO per operation increases after Postgres 15.15.

For update-one results don't change much across versions

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000281        12.492  0.001   0.011   0.741   162046  15.15
0.000278        12.554  0.001   0.012   0.746   162415  16.11
0.000253        11.028  0.002   0.029   0.764   174715  17.7
0.000254        10.960  0.001   0.011   0.707   172790  18.1

--- relative to first result
0.99            1.00    1.00    1.09    1.01    1.00    16.11
0.90            0.88    2.00    2.64    1.03    1.08    17.7
0.90            0.88    1.00    1.00    0.95    1.07    18.1

For update-zipf the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000721        6.517   0.716   6.583   18.811  41531   15.15
0.000813        7.354   0.746   7.838   22.746  35497   16.11
0.000971        5.679   0.796   10.492  27.858  29700   17.7
0.000966        5.718   0.838   10.354  28.965  29289   18.1

--- relative to first result
1.13            1.13    1.04    1.19    1.21    0.85    16.11
1.35            0.87    1.11    1.59    1.48    0.72    17.7
1.34            0.88    1.17    1.57    1.54    0.71    18.1

For write-only_range=10000 the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000784        5.881   0.799   7.611   26.835  30174   15.15
0.000920        6.315   0.85    10.033  32      26444   16.11
0.001003        5.883   0.871   11.147  33.142  25889   17.7
0.000999        5.905   0.891   10.988  34.443  25232   18.1

--- relative to first result
1.17            1.07    1.06    1.32    1.19    0.88    16.11
1.28            1.00    1.09    1.46    1.24    0.86    17.7
1.27            1.00    1.12    1.44    1.28    0.84    18.1

For read-write_range=100 the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.001358        7.756   1.758   25.654  22.615  31574   15.15
0.001649        8.111   1.86    30.99   35.413  22264   16.11
0.001629        7.609   1.856   29.901  28.681  25906   17.7
0.001646        7.484   1.978   29.456  29.138  25573   18.1

--- relative to first result
1.21            1.05    1.06    1.21    1.57    0.71    16.11
1.20            0.98    1.06    1.17    1.27    0.82    17.7
1.21            0.96    1.13    1.15    1.29    0.81    18.1

For read-write_range=10 the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000871        6.441   1.096   10.238  21.426  37141   15.15
0.001064        6.926   1.162   13.761  29.738  27540   16.11
0.001074        6.461   1.181   14.759  29.338  28839   17.7
0.001045        6.497   1.222   14.726  28.632  30431   18.1

--- relative to first result
1.22            1.08    1.06    1.34    1.39    0.74    16.11
1.23            1.00    1.08    1.44    1.37    0.78    17.7
1.20            1.01    1.11    1.44    1.34    0.82    18.1

Thursday, January 8, 2026

Postgres vs tproc-c on a small server

This is my first post with results from tproc-c using HammerDB. This post has results for Postgres. 

tl;dr - across 8 workloads (low and medium concurrency, cached database to IO-bound)

  • there might be a regression for Postgres 14.20 and 15.15 in one workload
  • there are improvements, some big, for Postgres 17 and 18 in most workloads

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 1213141516 and 17.

For Postgres 18 the config file is named conf.diff.cx10b_c8r32 and adds io_mod='sync' which matches behavior in earlier Postgres versions.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for several workloads:
  • vu=1, w=100 - 1 virtual user, 100 warehouses
  • vu=6, w=100 - 6 virtual users, 100 warehouses
  • vu=1, w=1000 - 1 virtual user, 1000 warehouses
  • vu=6, w=1000 - 6 virtual users, 1000 warehouses
  • vu=1, w=2000 - 1 virtual user, 2000 warehouses
  • vu=6, w=2000 - 6 virtual users, 2000 warehouses
  • vu=1, w=4000 - 1 virtual user, 4000 warehouses
  • vu=6, w=4000 - 6 virtual users, 4000 warehouses
The benchmark is run by this script which depends on scripts here.
  • stored procedures are enabled
  • partitioning is used for when the warehouse count is >= 1000
  • a 5 minute rampup is used
  • then performance is measured for 120 minutes

Results

All artifacts from the tests are here. A spreadsheet with the charts and numbers is here.

My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

On the charts that follow y-axis starts at 0.9 to improve readability. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
(NOPM for a given version / NOPM for Postgres 12.22)

Results: vu=1, w=100

Summary:
  • no regressions, no improvements
Results: vu=6, w=100

Summary:
  • no regressions, no improvements
Results: vu=1, w=1000

Summary:
  • no regressions, improvements in Postgres 17 and 18
Results: vu=6, w=1000

Summary:
  • no regressions, improvements in Postgres 16, 17 and 18
Results: vu=1, w=2000

Summary:
  • possible regressions in Postgres 14 and 15, improvements in 13, 16, 17\ and 18
Results: vu=6, w=2000

Summary:
  • no regressions, improvements in Postgres 13 through 18
Results: vu=1, w=4000

Summary:
  • no regressions, improvements in Postgres 13 through 18
Results: vu=6, w=4000

Summary:
  • no regressions, improvements in Postgres 16 through 18

Wednesday, January 7, 2026

SSDs, power loss protection and fsync latency

This has results to measure the impact of calling fsync (or fdatasync) per-write for files opened with O_DIRECT. My goal is to document the impact of the innodb_flush_method option. 

The primary point of this post is to document the claim:

For an SSD without power loss protection, writes are fast but fsync is slow.

The secondary point of this post is to provide yet another example where context matters when reporting performance problems. This post is motivated by results that look bad when run on a server with slow fsync but look OK otherwise. 

tl;dr

  • for my mini PCs I will switch from the Samsung 990 Pro to the Crucial T500 to get lower fsync latency. Both are nice devices but the T500 is better for my use case.
  • with a consumer SSD writes are fast but fsync is often slow
  • use an enterprise SSD if possible, if not run tests to understand fsync and fdatasync latency
Updates:

InnoDB, O_DIRECT and O_DIRECT_NO_FSYNC

When innodb_flush_method is set to O_DIRECT there are calls to fsync after each batch of writes. While  I don't know the source like I used to, I did browse it for this blog post and then I looked at SHOW GLOBAL STATUS counters. I think that InnoDB does the following with it set to O_DIRECT: 

  1. Do one large write to the doublewrite buffer, call fsync on that file
  2. Do the batch of in-place (16kb) page writes
  3. Call fsync once per database file that was written by step 2

When set to O_DIRECT_NO_FSYNC then the frequency of calls to fsync are greatly reduced and are only done in cases where important filesystem metadata needs to be updated, such as after extending a file.  The reference manual is misleading WRT the following sentence. I don't think that InnoDB ever does an fsync after each write. It can do an fsync after each batch of writes:

O_DIRECT_NO_FSYNCInnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation.

Many years ago it was risky to use O_DIRECT_NO_FSYNC on some filesystems because the feature as implemented (either upstream or in forks) didn't do fsync for cases where it was needed (see comment about metadata above). I experienced problems from this and I only have myself to blame. But the feature has been enhanced to do the right thing. And if the #whynotpostgres crowd wants to snark about MySQL not caring about data, lets not forget that InnoDB had per-page checksums long before Postgres -- those checksums made web-scale life much easier when using less than stellar hardware.

The following table uses results while running the Insert Benchmark for InnoDB to compute the ratio of fsyncs per write using the SHOW GLOBAL STATUS counters:
Innodb_data_fsyncs / Innodb_data_writes

And from this table a few things are clear. First, there isn't an fsync per write with O_DIRECT but there might be an fsync per batch of writes as explained above. Second, the rate of fsyncs is greatly reduced by using O_DIRECT_NO_FSYNC. 

5.7.44  8.0.44
.01046  .00729  O_DIRECT
.00172  .00053  O_DIRECT_NO_FSYNC

Power loss protection

I am far from an expert on this topic, but most SSDs have a write-buffer that makes small writes fast. And one way to achieve speed is to buffer those writes in RAM on the SSD while waiting for enough data to be written to an extent. But that speed means there is a risk of data loss if a server loses power. Some SSDs, especially those marketed as enterprise SSDs, have a feature called power loss protection that make data loss unlikely. Other SSDs, lets call them consumer SSDs, don't have that feature while some of the consumer SSDs claim to make a best effort to flush writes from the write buffer on power loss.

One solution to avoiding risk is to only buy enterprise SSDs. But they are more expensive, less common, and many are larger (22120 rather than 2280) because more room is needed for the capacitor or other HW that provides the power loss protection. Note that power loss protection is often abbreviated as PLP.

For devices without power loss protection it is often true that writes are fast but fsync is slow. When fsync is slow then calling fsync more frequently in InnoDB will hurt performance.

Results from fio

I used this fio script to measure performance for writes for files opened with O_DIRECT. The test was run twice configuration for 5 minutes per run followed by a 5 minute sleep. This was repeated for 1, 2, 4, 8, 16 and 32 fio jobs but I only share results here for 1 job. The configurations tested were:

  • O_DIRECT without fsync, 16kb writes
  • O_DIRECT with an fsync per write, 16kb writes
  • O_DIRECT with an fdatasync per write, 16kb writes
  • O_DIRECT without fsync, 2M writes
  • O_DIRECT with an fsync per write, 2M writes
  • O_DIRECT with an fdatasync per write, 2M writes
Results from all tests are here. I did the test on several servers:
  • dell32
    • a large server I have at home. The SSD is a Crucial T500 2TB using ext-4 with discard enabled and Ubuntu 24.04. This is a consumer SSD. While the web claims it has PLP via capacitors the fsync latency for it was almost 1 millisecond.
  • gcp
    • a c3d-standard-30-lssd from the Google cloud with 2 local NVMe devices using SW RAID 0 and 1TB of Hyperdisk Balanced storage configured for 50,000 IOPs and 800MB/s of throughput. The OS is Ubuntu 24.04 and I repeated tests for both ext-4 and xfs, both with discard enabled. I was not able to determine the brand of the local NVMe devices.
  • hetz
    • an ax162-s from Hetzner with 2 local NVME devices using SW RAID 1. Via udiskctl status I learned the devices are Intel D7-P5520 (now Solidigm). These are datacenter SSDs and the web claims they have power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 without discard enabled. 
  • ser7
  • socket2
    • a 2-socket server I have at home. The SSD is a Samsung PM-9a3. This is an enterprise SSD with power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 with discard enabled.
Results: overview

All of the results are here.

This table lists fsync and fdatasync latency per server:
  • for servers with consumer SSDs (dell, ser7) the latency is much larger on the ser7 that uses a Samsung 990 Pro than on the dell that uses a Crucial T500. This is to be expected given that the T500 has PLP while the 990 Pro does not.
  • sync latency is much lower on servers with enterprise SSDs
  • sync latency after 2M writes is sometimes much larger than after 16kb writes
  • for the Google server with Hyperdisk Balanced storage the fdatasync latency was good but fsync latency was high. While with the local NVMe devices the latencies were larger than for enterprise SSDs but much smaller than for consumer SSDs.

--- Sync latency in microseconds for sync after 16kb writes

dell    hetz    ser7    socket2
891.1   12.4    2974.2  1.6     fsync
447.4    9.8    2783.2  0.7     fdatasync

gcp
local devices           hyperdisk
ext-4   xfs             ext-4   xfs
56.2    39.5            738.1   635.0   fsync
28.1    29.0             46.8    46.0   fdatasync

--- Sync latency in microseconds for sync after 2M writes

dell    hetz    ser7    socket2
980.1   58.2    5396.8  139.1   fsync
449.7   10.8    3508.2    2.2   fdatasync

gcp
local devices           hyperdisk
ext-4   xfs             ext-4   xfs
1020.4  916.8           821.2   778.9   fsync
 832.4  809.7            63.6    51.2   fdatasync

Results: dell

Summary:
  • Write throughput drops dramatically when there is an fsync or fdatasync per write because sync latency is large.
  • This servers uses a consumer SSD so high sync latency is expected
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)

16 KB writes
w/s     MB/s    sync    test
43400   646.6   0.0     no-sync
43500   648.5   0.0     no-sync
-
1083    16.1    891.1   fsync
1085    16.2    889.2   fsync
-
2100    31.3    447.4   fdatasync
2095    31.2    448.6   fdatasync

2 MB writes
w/s     MB/s    sync    test
2617    4992.5  0.0     no-sync
2360    4502.3  0.0     no-sync
-
727     1388.5  980.1   fsync
753     1436.2  942.5   fsync
-
1204    2297.4  449.7   fdatasync
1208    2306.0  446.9   fdatasync

Results: gcp

Summary
  • Local NVMe devices have lower sync latency and more throughput with and without a sync per write at low concurrency (1 fio job).
  • At higher concurrency (32 fio jobs), the Hyperdisk Balanced setup provides similar throughput to local NVMe and would do even better had I paid more to get more IOPs and throughput. Results don't have nice formatting but are here for xfs on the local and Hyperdisk Balanced devices.
  • fsync latency is ~2X larger than fdatasync on the local devices and closer to 15X larger on the Hyperdisk Balanced setup. That difference is interesting. I wonder what the results are for Hyperdisk Extreme.
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
--- ext-4 and local devices

16 KB writes
w/s     MB/s    sync    test
10100   150.7   0.0     no-sync
10300   153.5   0.0     no-sync
-
6555    97.3    56.2    fsync
6607    98.2    55.1    fsync
-
8189    122.1   28.1    fdatasync
8157    121.1   28.2    fdatasync

2 MB writes
w/s     MB/s    sync    test
390     744.8   0.0     no-sync
390     744.8   0.0     no-sync
-
388     741.0   1020.4  fsync
388     741.0   1012.7  fsync
-
390     744.8   832.4   fdatasync
390     744.8   869.6   fdatasync

--- xfs and local devices

16 KB writes
w/s     MB/s    sync    test
9866    146.9   0.0     no-sync
9730    145.0   0.0     no-sync
-
7421    110.6   39.5    fsync
7537    112.5   38.3    fsync
-
8100    121.1   29.0    fdatasync
8117    121.1   28.8    fdatasync

2 MB writes
w/s     MB/s    sync    test
390     744.8   0.0     no-sync
390     744.8   0.0     no-sync
-
389     743.9   916.8   fsync
389     743.9   919.1   fsync
-
390     744.8   809.7   fdatasync
390     744.8   806.5   fdatasync

--- ext-4 and Hyperdisk Balanced

16 KB writes
w/s     MB/s    sync    test
2093    31.2    0.0     no-sync
2068    30.8    0.0     no-sync
-
804     12.0    738.1   fsync
798     11.9    740.6   fsync
-
1963    29.3    46.8    fdatasync
1922    28.6    49.0    fdatasync

2 MB writes
w/s     MB/s    sync    test
348     663.8   0.0     no-sync
367     701.0   0.0     no-sync
-
278     531.2   821.2   fsync
271     517.8   814.1   fsync
-
358     683.8   63.6    fdatasync
345     659.0   64.5    fdatasync

--- xfs and Hyperdisk Balanced

16 KB writes
w/s     MB/s    sync    test
2033    30.3    0.0     no-sync
2004    29.9    0.0     no-sync
-
870     13.0    635.0   fsync
858     12.8    645.0   fsync
-
1787    26.6    46.0    fdatasync
1727    25.7    49.6    fdatasync

2 MB writes
w/s     MB/s    sync    test
343     655.2   0.0     no-sync
343     655.2   0.0     no-sync
-
267     511.2   778.9   fsync
268     511.2   774.7   fsync
-
347     661.8   51.2    fdatasync
336     642.8   54.4    fdatasync

Results: hetz

Summary
  • this has an enterprise SSD with excellent (low) sync latency
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s     MB/s    sync    test
37700   561.7   0.0     no-sync
37500   558.9   0.0     no-sync
-
25200   374.8   12.4    fsync
25100   374.8   12.4    fsync
-
27600   411.0   0.0     fdatasync
27200   404.4   9.8     fdatasync

2 MB writes
w/s     MB/s    sync    test
1833    3497.1  0.0     no-sync
1922    3667.8  0.0     no-sync
-
1393    2656.9  58.2    fsync
1355    2585.4  59.6    fsync
-
1892    3610.6  10.8    fdatasync
1922    3665.9  10.8    fdatasync

Results: ser7

Summary:
  • this has a consumer SSD with high sync latency
  • results had much variance (see the 2MB results below) and results at higher concurrency. This is a great SSD, but not for my use case.
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s     MB/s    sync    test
34000   506.4   0.0     no-sync
40200   598.9   0.0     no-sync
-
325     5.0     2974.2  fsync
333     5.1     2867.3  fsync
-
331     5.1     2783.2  fdatasync
330     5.0     2796.1  fdatasync

2 MB writes
w/s     MB/s    sync    test
362     691.4   0.0     no-sync
364     695.2   0.0     no-sync
-
67      128.7   10828.3 fsync
114     218.4   5396.8  fsync
-
141     268.9   3864.0  fdatasync
192     368.1   3508.2  fdatasync

Results: socket2

Summary:
  • this has an enterprise SSD with excellent (low) sync latency after small writes, but fsync latency after 2MB writes is much larger
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s     MB/s    sync    test
49500   737.2   0.0     no-sync
49300   734.3   0.0     no-sync
-
44500   662.8   1.6     fsync
45400   676.2   1.5     fsync
-
46700   696.2   0.7     fdatasync
45200   674.2   0.7     fdatasync

2 MB writes
w/s     MB/s    sync    test
707     1350.4  0.0     no-sync
708     1350.4  0.0     no-sync
-
703     1342.8  139.1   fsync
703     1342.8  122.5   fsync
-
707     1350.4  2.2     fdatasync
707     1350.4  2.1     fdatasync


Friday, January 2, 2026

Common prefix skipping, adaptive sort

The patent expired for US7680791B2. I invented this while at Oracle and it landed in 10gR2 with claims of ~5X better performance vs the previous sort algorithm used by Oracle. I hope for an open-source implementation one day. The patent has a good description of the algorithm, it is much easier to read than your typical patent. Thankfully the IP lawyer made good use of the functional and design docs that I wrote.

The patent is for a new in-memory sort algorithm that needs a name. Features include:

  • common prefix skipping
    • skips comparing the prefix of of key bytes when possible
  • adaptive
    • switches between quicksort and most-significant digit radix sort
  • key substring caching
    • reduces CPU cache misses by caching the next few bytes of the key
  • produces results before sort is done
    • sorted output can be produced (to the rest of the query, or spilled to disk for an external sort) before the sort is finished. 
Update:
  • the sort algorithm needs a name and common prefix skipping adaptive quicksort is much too long. So I suggest Orasort.

How it came to be

From 2000 to 2005 I worked on query processing for Oracle. I am not sure why I started on this effort and it wasn't suggested by my bosses or peers. But the Sort Benchmark contest was active and I had more time to read technical papers. Perhaps I was inspired by the Alphasort paper.

While the Sort Benchmark advanced the state of the art in sort algorithms, it also encouraged algorithms that were great for benchmarks (focus on short keys with uniform distribution). But keys sorted by a DBMS are often much larger than 8 bytes and adjacent rows often have long common prefixes in their keys.

So I thought about this while falling to sleep and after many nights realized that with a divide and conquer sort, as the algorithm descends into subpartitions of the data, that the common prefixes of the keys in each subpartition were likely to grow:

  • were the algorithm able to remember the length of the common prefix as it descends then it can skip the common prefix during comparisons to save on CPU overhead
  • were the algorithm able to learn when the length of the common prefix grows then it can switch from quicksort to most-significant digit (MSD) radix sort using the next byte beyond the common prefix and then switch back to quicksort after doing that
  • the algorithm can cache bytes from the key in an array, like Alphasort. But unlike Alphasort as it descends it can cache the next few bytes it will need to compare rather than only caching the first few bytes of the key. This provides much better memory system behavior (fewer cache misses).
Early implementation

This might have been in 2003 before we were able to access work computers from home. I needed to get results that would convince management this was worth doing. I started my proof-of-concept on an old PowerPC based Mac I had at home that found a second life after I installed Yellow Dog Linux on it.

After some iteration I had good results on the PowerPC. So I brought my source code into work and repeated the test on other CPUs that I could find. On my desk I had a Sun workstation and a Windows PC with a 6 year old Pentium 3 CPU (600MHz, 128kb L2 cache). Elsewhere I had access to a new Sun server with a 900MHz UltraSPARC IV (or IV+) CPU and an HP server with a PA RISC CPU.

I also implemented other state of the art algorithms including Alphasort along with the old sort algorithm used by Oracle. From testing I learned:
  1. my new sort was much faster than other algorithms when keys were larger than 8 bytes
  2. my new sort was faster on my old Pentium 3 CPU than on the Sun UltraSPARC IV
The first was great news for me, the second was less than great news for Sun shareholders. I never learned why that UltraSPARC IV performance was lousy. It might have been latency to the caches.

Real implementation

Once I had great results, it was time for the functional and design specification reviews. I remember two issues:
  • the old sort was stable, the new sort was not
    • I don't remember how this concern was addressed
  • the new sort has a bad, but unlikely, worst-case
    • The problem here is the worst-case when quicksort picks the worst pivot every time it selects a pivot. The new sort wasn't naive, it used the median from a sample of keys each time to select a pivot (the sample size might have been 5). So I did the math to estimate the risk. Given that the numbers are big and probabilities are small I needed a library or tool that supported arbitrary-precision arithmetic and ended up using a Scheme implementation. The speedup in most cases justified the risk in a few cases.
And once I had this implemented within the Oracle DBMS I was able to compare it with the old sort. The new sort was often about 5 times faster than the old sort. I then compared it with SyncSort. I don't remember whether they had a DeWitt Clause so I won't share the results but I will say that the new sort in Oracle looked great in comparison.

The End

The new sorted landed in 10gR2 and was featured in a white-paper. I also got a short email from Larry Ellison thanking me for the work. A promotion or bonus would have to wait as you had to play the long-game in your career at Oracle. And that was all the motivation I needed to leave Oracle -- first for a startup, and then to Google and Facebook.

After leaving Oracle, much of my time was spent on making MySQL better. Great open-source DBMS, like MySQL and PostgreSQL, were not good for Oracle's new license revenue. Oracle is a better DBMS, but not everyone needs it or can afford it.

Tuesday, December 30, 2025

Performance for RocksDB 9.8 through 10.10 on 8-core and 48-core servers

This post has results for RocksDB performance using db_bench on 8-core and 48-core servers. I previously shared results for RocksDB performance using gcc and clang and then for RocksDB on a small Arm server

tl;dr

  • RocksDB is boring, there are few performance regressions. 
  • There was a regression in write-heavy workloads with RocksDB 10.6.2. See bug 13996 for details. That has been fixed.
  • I will repeat tests in a few weeks

Software

I used RocksDB versions 9.8 through 10.0.

I compiled each version clang version 18.3.1 with link-time optimization enabled (LTO). The build command line was:

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make "${flags[@]}" static_lib db_bench

Hardware

I used servers with 8 and 48 cores, both run Ubuntu 22.04:

  • 8-core
    • Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM.
    • storage is one NVMe SSD with discard enabled and ext-4
    • benchmarks are run with 1 client, 20M KV pairs for byrx and 400M KV pairs for iobuf and iodir
  • 48-core
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM
    • storage is 2 SSDs with RAID 1 (3.8T each) and ext-4.
    • benchmarks are run with 36 clients, 200M KV pairs for byrx and 2B KV pairs for iobuf and iodir

Benchmark

Overviews on how I use db_bench are here and here.

Most benchmark steps were run for 1800 seconds and all used the LRU block cache. I try to use Hyperclock on large servers but forgot that this time.

Tests were run for three workloads:

  • byrx - database cached by RocksDB
  • iobuf - database is larger than RAM and RocksDB used buffered IO
  • iodir - database is larger than RAM and RocksDB used O_DIRECT

The benchmark steps that I focus on are:
  • fillseq
    • load RocksDB in key order with 1 thread
  • revrangeww, fwdrangeww
    • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
  • readww
    • do point queries with a rate-limited writer. Report performance for the point queries.
  • overwrite
    • overwrite (via Put) random keys and wait for compaction to stop at test end

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
    (QPS for my version / QPS for RocksDB 9.8)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than RocksDB 9.8. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here. Performance summaries are here.

Results: cached database (byrx)

From 1 client on the 8-core server

  • Results are stable except for the overwrite test where there might be a regression, but I think that is noise after repeating this test 2 more times and the cause is that the base case (result from 9.8) was an outlier. I will revisit this.

From 36 clients on the 48-core server

  • Results are stable

Results: IO-bound with buffered IO (iobuf)

From 1 client on the 8-core server

  • Results are stable except for the overwrite test where there might be a large improvement. But I wonder if this is from noise in the result for the base case from RocksDB 9.8, just as there might be noice in the cached (byrx) results.
  • The regression in fillseq with 10.6.2 is from bug 13996

From 36 clients on the 48-core server
  • Results are stable except for the overwrite test where there might be a large improvement. But I wonder if this is from noise in the result for the base case from RocksDB 9.8, just as there might be noice in the cached (byrx) results.
  • The regression in fillseq with 10.6.2 is from bug 13996
Results: IO-bound with O_DIRECT (iodir)

From 1 client on the 8-core server

  • Results are stable
  • The regression in fillseq with 10.6.2 is from bug 13996

From 36 clients on the 48-core server

  • Results are stable
  • The regression in fillseq with 10.6.2 is from bug 13996


Monday, December 29, 2025

IO-bound sysbench vs Postgres on a 48-core server

This has results for an IO-bound sysbench benchmark on a 48-core server for Postgres versions 12 through 18. Results from a CPU-bound sysbench benchmark on the 48-core server are here.

tl;dr - for Postgres 18.1 relative to 12.22

  • QPS for IO-bound point-query tests is similar while there is a large improvement for the one CPU-bound test (hot-points)
  • QPS for range queries without aggregation is similar
  • QPS for range queries with aggregation is between 1.05X and 1.25X larger in 18.1
  • QPS for writes show there might be a few large regressions in 18.1
tl;dr - for Postgres 18.1 using different values for the io_method option
  • for tests that do long range queries without aggregation
    • the best QPS is from io_method=io_uring
    • the second best QPS is from io_method=worker with a large value for io_workers
  • for tests that do long range queries with aggregation
    • when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
    • for most tests the best QPS is from io_method=io_uring

Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.23, 14.20, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

I used a 48-core server from Hetzner
  • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
Configuration files for the big server
Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries without aggregation while part 2 has queries with aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than base version.

I provide two comparisons and each uses a different base version. They are:
  • base version is Postgres 12.22
    • compare 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1
    • the goal for this is to see how performance changes over time
    • per-test results from vmstat and iostat are here
  • base version is Postgres 18.1
    • compare 18.1 using the x10b_c32r128, x10c_c32r128, x10cw8_c32r128, x10cw16_c32r128, x10cw32_c32r128 and x10d_c32r128 configs
    • the goal for this is to understand the impact of the io_method option
    • per-test results from vmstat and iostat are here
The per-test results from vmstat and iostat can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: Postgres 12.22 through 18.1

All charts except the first have the y-axis start at 0.7 rather than 0.0 to improve readability.

There are two charts for point queries. The second truncates the y-axis to improve readability.
  • a large improvement for the hot-points test arrives in 17.x. While most tests are IO-bound, this test is CPU-bound because all queries fetch the same N rows.
  • for other tests there are small changes, both improvements and regressions, and the regressions are too small to investigate
For range queries without aggregation:
  • QPS for Postgres 18.1 is within 5% of 12.22, sometimes better and sometimes worse
  • for Postgres 17.7 there might be a large regression on the scan test and that also occurs with 17.6 (not shown). But the scan test can be prone to variance, especially with Postgres and I don't expect to spend time debugging this. Note that the config I use for 18.1 here uses io_method=sync which is similar to what Postgres uses in releases prior to 18.x. From the vmstat and iostat metrics what I see is:
    • a small reduction in CPU overhead (cpu/o) in 18.1
    • a large reduction in the context switch rate (cs/o) in 18.1
    • small reductions in read IO (r/o and rKB/o) in 18.1
For range queries with aggregation:
  • QPS for 18.1 is between 1.05X and 1.25X better than for 12.22
For write-heavy tests
  • there might be large regressions for several tests: read-write, update-zipf and write-only, The read-write tests do all of the writes done by write-only and then add read-only statements. 
  • from the vmstat and iostat results for the read-write tests I see
    • CPU (cpu/o) is up by ~1.2X in PG 16.x through 18.x
    • storage reads per query (r/o) have been increasing from PG 16.x through 18.x and are up by ~1.1X in PG 18.1
    • storage KB read per query (rKB/o) increased started in PG 16.1 and are 1.44X and 1.16X larger in PG 18.x
  • from the vmstat and iostat results for the update-zipf test
    • results are similar to the read-write tests above
  • from the vmstat and iostat results for the write-only test
    • results are similar to the read-write tests above
Results: Postgres 18.1 and io_method

For point queries
  • results are similar for all configurations and this is expected
For range queries without aggregation
  • there are two charts, the y-axis is truncated in the second to improve readability
  • all configs get similar QPS for all tests except scan
  • for the scan test
    • the x10c_c32r128 config has the worst result. This is expected given there are 40 concurrent connections and it used the default for io_workers (=3)
    • QPS improves for io_method=worker with larger values for io_workers
    • io_method=io_uring has the best QPS (the x10d_c32r128 config)
For range queries with aggregation
  • when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
  • io_method=io_uring gets the best QPS on all tests except for the read-only tests with range=10 and 10,000. There isn't an obvious problem based on the vmstat and iostat results. From the r_await column in iostat output (not shown) the differences are mostly explained by a change in IO latency. Perhaps variance in storage latency is the issue.
For writes
  • the best QPS occurs with the x10b_c32r128 config (io_method=sync). I am not sure if that option matters here and perhaps there is too much noise in the results.

Debugging regressions with Postgres in IO-bound sysbench

I explained in this post that there is a possible performance regression for Postgres with IO-bound sysbench. It arrived in Postgres 16 and...