NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Removing fsync from our local storage engine (fractalbits.com)
bradfa 21 hours ago [-]
There’s lies, damn lies, and lies that disks tell the operating system. Don’t believe any of them!

If you need to know it’s been persisted to non-volatile storage then you need to own the full stack of every piece of software between the OS and the actual physical memory.

Every managed flash drive is going to have layers and layers of complexity and caching and things you simply can’t easily control or really understand. Don’t trust it unless you know exactly how it works all the way down.

thomas_fa 19 hours ago [-]
Well said and there are some bitter lessons in the storage industry.

In my last company we need to disable the disk write cache during each reboot, and we also heard a lot industry stories related to underneath firmware implementation from oxide computer podcasts [1]. Yes, to provide truly reliable service, we need to evaluate underneath hardware settings case-by-case.

[1] https://onthemetal.transistor.fm/

nh2 23 hours ago [-]
> fsync doesn’t just sync the file’s data, it syncs every piece of metadata the file depends on: ... directory entry

Famously not, as the man page says.

It is also said later in the article:

> POSIX strictly requires a parent-directory fsync to make a newly created file’s existence durable.

So I'm not sure why the dirent sync is claimed earlier.

thomas_fa 21 hours ago [-]
Thanks for pointing it out the mistakes. We should make it clearer, when fsync an opened file descriptor, it would only sync its own metadata. To make it truly persistent, we need to issue another fsync for the directory fd, which would make it more expensive.
dezgeg 21 hours ago [-]
You don't need to do that for every write though. Only when the database file is created.
thomas_fa 21 hours ago [-]
Yes, especially for our object storage each putObject would need to create new entry for in the (data)name space which would need fsync for dir fd.
matja 22 hours ago [-]
Even with O_DIRECT and aligned blocks, I still don't understand how the storage engine can return a "successful commit" to the client without a sync at some point, because a sync (IIRC) is the only way to guarantee an ATA/NVMe FUA command is sent, and the device write cache/buffer is committed.
klodolph 22 hours ago [-]
:-/ it’s a statistical guarantee in the first place. A successful commit in a durable storage engine just needs to achieve some finite level of durability, like “10^-7 probability of loss per year”. The durability is a property of the whole system, and it is possible to achieve durability without fsync, you just may have a hard time explaining what the durability is, how you calculated it, and what the evidence or justifications are for the numbers you give.

Even if you just look at hardware failure rates, you get unrecoverable I/O errors (data corruption) at about one in 10^15 bits, disk failures at a rate of about 1% per year, etc. People usually like to have better guarantees than those numbers give you with just a plain fsync anyway; so you are probably forced to do an analysis of the whole system if you want to provide good durability guarantees and be able to explain where the guarantees come from.

asdfasgasdgasdg 21 hours ago [-]
10^-7 (loss/record) * 10^8 (record/year) yields 10 data losses per year. If you're even a medium sized business you need a much better than 10^-7 probability of losses.
klodolph 10 hours ago [-]
The half-remembered storage system I pulled those numbers from had records ~100G in size, so a 10^-7 loss is 1 loss event per year, per exabyte of data. A loss event is just “at least one bit in the record cannot be read within a certain deadline”.

Durability is a knob. If you have enough data, or turn the knob too far in the direction of durability, you will simply bankrupt yourself or maybe drown your service in latency. It makes sense that you would have storage services that provide different levels of durability.

Dylan16807 20 hours ago [-]
That's only true if your typical loss event loses one record. If you have a one in a million chance of an array failure taking out 10% of your production database, and otherwise have zero possibility of data loss, you also get 10^-7 losses per record.

And I wouldn't assume they meant that number to be per record in the first place.

asdfasgasdgasdg 20 hours ago [-]
I don't think anyone in history has ever achieved a true 10^-7 annual probability of any data loss incident. So they must have been making some kind of per record or per operation claim.
klodolph 9 hours ago [-]
I like to think that the true AFR for data is bounded by something like 10^-3, because maybe that’s close to the rate at which civilizations collapse. You have to use a kind of subtle definition to support 10^-7 or 10^-9 or 10^-11. Or maybe instead of “subtle definition”, you can call it a “whimsical, imaginary definition”. Depends on how cynical you are.

The way I would go is by saying that you multiply the number of objects by AFR, and that’s close to the actual losses on most years. You can then exclude WW3 and the late holocene extinction event from your consideration. Or simple bankruptcy, for that matter. If your employer is gone, you don’t care about its data any more.

jakewins 21 hours ago [-]
I used to say this as well but like.. industry has, for a long time now equated “durable” with “stored on disk”. Any DBA will assume that’s what it means, and use that fact when they work out the replication they need either in clustering or in raid.

If you’re building a data storage system and are using the term “durable” to mean “it’s in RAM on three virtual machines”, for example, I don’t think it’s unfair to say that you are lying to your customers, because you are intentionally misusing a well-established term.

zbentley 12 hours ago [-]
I forget the product, but more than a decade ago I remember someone broke out their durability into a table with columns for all the settings their data store offered between “ram on one node” and “fsync confirmed on a quorum of nodes’ disks” and rows for example failure cases ranging from “unexpected reboot of one machine” to “catastrophic loss of quorum-1 machines”. Cells were data loss risks from “prevented” to “possible” to “likely”.

That was very helpful when choosing durability levels.

klodolph 10 hours ago [-]
I don’t have any respect for the viewpoint that “durable” is equatable with “stored on disk”, and I don’t want to spend time accommodating that viewpoint. It is just an oversimplification in a very bad way.

AFRs and discussions about different failure scenarios are the bare minimum. The bare minimum for scenarios is disk loss, total machine loss, and data center loss. This is just my take on things. I don’t care if something is on disk or not. I do care what happens when a sector on disk goes bad, when a faulty power supply destroys all the disks in a machine, or when a data center floods.

That forces you to think about things like whether you want to turn on synchronous replication.

thomas_fa 21 hours ago [-]
Yes, as we mentioned in the post, it is targeted for the virtualized NVME disk and we don't have control for actually issing FUA command. We are also changing to open data files with O_DATA_SYNC to make them work with normal on-prem deployment environments.
nh2 20 hours ago [-]
Even then, I also share the confusion of the poster you're replying to.

I don't see how a virtualised NVMe disk is different from a physical one.

Especially if you don't have control over the underlying hardware (so you don't know if it has power-loss-protection PLP SSDs), you should send the FUA.

> O_DATA_SYNC

You mean `O_DSYNC`?

Why would you need `O_DSYNC` on-premise, but not on cloud VMs? (Or are you saying you'd include it everywhere?) Similar to my above point, surely it is the task of the VM to pass through any FUA commands the VM guest issues to the actual storage?

Further: Is `O_DSYNC` actually substantially different from writing and then `fdatasync()`ing yourself?

My understand is that no, it's the same. In particular, the same amount of data gets written. So if you believe that to avoid the "can trigger an order of magnitude more I/O" by avoiding `fdatasync()`, you would re-introduce it with `O_DSYNC`.

However, I suspect that that whole consideration is pointless:

The only thing that makes your O_DIRECT+preallocated-only-overwrites writes safe are enterprise SSDs with Power Loss Protection (PLP), usually capacitors.

On those SSDs, NVMe Flush/FUA are no-ops [1]. So you might as well `fdatasync()`/`O_DSYNC`, always. This is simpler, and also better because you do not need to assume/hope that your underlying SSDs have PLP: Doing the safe thing is fast on PLP [2], and safe on non-PLP.

    [1] https://news.ycombinator.com/item?id=46532675
    [2] https://tanelpoder.com/posts/using-pg-test-fsync-for-testing-low-latency-writes/
So the only remaining benefit of `O_DSYNC` over `fdatasync()` is that you save a syscall. That's an OK optimisation given they are equivalent, but it would surprise me if it had any noticeable impact at the latencies you are reporting ("413 us"), because [2] reports the difference beting 6 us.

Let me know if I got anything wrong.

The only remaining question is: Why do you then see any difference in your benchmark?

    Configuration            Throughput (obj/s)
    -------------------------------------------
    ext4 + O_DIRECT + fsync             116,041
    Our engine                          190,985
That is what I'd find very valuable to investigate.

The first suspicion I have is: Shouldn't you be measuring `+ fdatasync` instead?

So I'd be interested in:

    ext4 + O_DIRECT + fdatasync
    ext4 + O_DIRECT + O_DSYNC
    Our engine + O_DSYNC (which you're suggesting above)
Also I don't fully understand what the remaining diference between "ext4 + O_DIRECT + O_DSYNC" and "Our engine + O_DSYNC" would be.
jmalicki 3 hours ago [-]
Fsync on PLP drives isn't strictly a NOP - you still take a latency hit from the round trip of the command to the NVMe device, where it is implemented as a NOP.
thomas_fa 19 hours ago [-]
Thanks for the feedback, since I have relied in other thread related to O_DSYNC which a lot of folks have already suggested, and I will not repeat it here.

For the benchmark results, and they were mainly due to metadata management. We have implemented our own KV store, see internal here [1], which is more efficient than ext4 namespace management, even after doing very aggressive fs tuning for that [2] (plus 65536 sharding for each leveled dir).

[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...

[2] https://github.com/fractalbits-labs/fractalbits/commit/12109...

binaryturtle 22 hours ago [-]
To truly guarantee things you probably also would need an uncached read afterwards (to verify the data comes back properly from the device). Now that would kill any sort of performance, of course.
asdfasgasdgasdg 21 hours ago [-]
There is no such thing as a guarantee in life, there are only probabilities. The goal is to make it sufficiently unlikely that data is lost, and to balance that against the cost of doing so.

That is where the disparity lies here. Reading back the data after the device reports that it has been written offers little in the way of additional assurances that it's successfully written. But if you report successful writes without syncing, there is a near certainty that you'll lose data on every power loss.

mightyham 21 hours ago [-]
Unless I am mistaken, it seems like there is a glaring flaw in this scheme, which is that without fsync you cannot guarantee the previous WAL blocks have been persisted before the current one, so a power loss event could leave a hole in the log and cause erroneous recovery. I believe that SSDs reorder writes internally so even having atomic batched O_DIRECT is not a strong enough guarantee for durability. I'll admit that I could be misunderstanding something about the system that alleviates this concern.
hedora 20 hours ago [-]
Assuming O_DIRECT actually blocks until the SSD has acked (this isn't actually what O_DIRECT's contract says, but what they rely on), you have to wait until each page write acks whenever you need a persistence barrier.

My guess is the preallocation + zeroing is what got them most of the win, and the O_DIRECT is actually hurting, not helping throughput. This has been the case 100% of the time I've benchmarked such things.

If you're doing this sort of stuff for real under Linux, check out sync_file_range. It's the only non-broken and performant sync API for ext4 (note that it's broken by design for many other file systems, and the API is terribly difficult to use correctly).

If you really care, it's probably just easier to use SPDK or something. Linux has historically been pretty hostile towards DBMS implementations.

thomas_fa 20 hours ago [-]
That's a lot of valuable information and thanks for the input. Yes the original blog post is mainly focusing on reducing the metadata overhead due to fsync(), and I got a lot of good feedback from here and a lot of discussion is beyond our original scenario settings. We would like to incorporate all these enhancement suggestions without re-introducing fsync(), and make it work for more general environments.
jandrewrogers 20 hours ago [-]
Many storage devices guarantee that all successful DMA (e.g. O_DIRECT) writes are persisted even in the event of a power loss. This does not work on storage devices that do not offer this guarantee obviously. It also does not work if the filesystem does not support direct I/O or requires metadata updates.

This is not a new trick. It has been used in many storage engine designs to effect durability without an fsync.

mightyham 20 hours ago [-]
Thanks, that's interesting and I wasn't aware of that. Is there a consistent way to detmine if a device offers this garuntee at runtime on Linux?
zzsheng 15 hours ago [-]
thanks for feedback. actually it was pointed out in blog that we do not use append only log to avoid fsync due to size change. what we use is preallocate fixed size log file and we do write journal data and space reclaim by 4KB unit, also with direct-io.
seebeen 21 hours ago [-]
I also asked what happens when a power loss happens.
convolvatron 20 hours ago [-]
if there is a hole in the log then the end of the log is before the hole. you do have to have checksums on log chunks, and better a kind of rolling hash, but you're really just talking about he number of entires that we would have liked to commit but didn't
mightyham 20 hours ago [-]
Yeah this is a good point, and maybe a hole wasn't the right way to explain myself. The point is that the way a WAL is supposed to work is that the main data store always lags behind the WAL, so that if a partial operation (always idempotent) occurs on shutdown it is replayed on start up and fixed. In the case I describe, because of a lack of fsync it's possible for the WAL to lag the main data store, so partial operations will not be fixed on start up.
convolvatron 20 hours ago [-]
that's a much more interesting problem. fundamentally we're in a bad position by having two different formats, one optimized for writing and one for reading, that admit inconsistency between them. Postgres mitigates this slightly by having page level updates to the read indices also be present in the log (physiological), but that's always seemed like a huge waste to me.

if we give ourselves two definitions of persisted - logically(wal or write) and physically (index or read), it seems like we can maintain the invariant that P < L. (1) by keeping an in memory view of P-L that we have to consult on every read to assert eh delta and (2) an expensive but asynchronous flush path for updating P driven from reads verifying L has landed, then have we patched all the holes(?).

edit: of course one of the root problems here is the drive lying, so how can we understand that some log block has actually commit so that we can update P

sethev 21 hours ago [-]
This seems sketchy. O_DIRECT skips the operating system's page cache, it does not guarantee that the SSD driver sent the data to the SSD or issued a flush to the drive itself. The data could still be in the driver's memory or the in non-durable memory in the drive itself when this engine says "ok, we're good".

EDIT: sketchy from an answering "what exactly are the guarantees?" perspective

jandrewrogers 21 hours ago [-]
The model here is that the storage device is directly reading and writing the userspace buffer via DMA. It is one of the reasons use of O_DIRECT creates additional constraints on buffer alignment and size.

Some storage devices guarantee durability of non-persisted writes, which is explicitly part of their model. Consequently, the entire durable write path is the storage device completing a DMA read of their buffer.

The underlying assumptions will not hold true for every environment. However, it will hold true for many and you can check most (all?) of them at runtime.

sethev 18 hours ago [-]
Right - I mean, what you're describing makes sense, but it doesn't sound like what they're describing. Their benchmarks are running on an EC2 instance and the post's author is here saying that they run on virtualized hardware. Plus they run on top of a file system. None of that screams "direct DMA from our buffers" to me.

I'm not saying it's impossible, but typically people who want to lean on hardware guarantees for extra performance control more of the stack.

myself248 23 hours ago [-]
To step back a bit, the device still has a filesystem on it, and the structures described here are files within the filesystem? Just you're able to write directly into them, bypassing the filesystem layer, because you've constrained yourself to writes that don't require updating other parts of the filesystem structure?
thomas_fa 21 hours ago [-]
Yes, that's right. We could go even further, to use the raw devices without relying on any filesystem. We then need to allocate/format raw disk spaces and we can not just open files as simple as right now. It would take some extra effort, but we would like to explore that in the future.

It will also make the system initialization faster, since right now we need to write all zeros to make ext4/xfs to actually initialize extents as "allocated".

zzsheng 3 days ago [-]
Author here. This is not a general argument against fsync; the design depends on SSD-only deployment, preallocated files, O_DIRECT, single-key atomicity, and device write guarantees.
atombender 4 hours ago [-]
I'm surprised none of the design decisions considered an indirection between the folder tree structure and the actual files.

For example, if you map folders like /foo and /foo/bar to numeric IDs, then each file can simply refer their parent folder. Renaming a folder, or moving a folder to a new parent, does not need to update any files.

You can take this a step further and have a three-level split: Tree, file-tree join table, and files. The tree describes the hierarchical structure of folders (which changes more rarely than files do), while the file-tree join table is essentially [folder_id, file_id]. When a file is moved, only the join table (which is much smaller than the files and super sortable and compressible) must be updated.

I take the point that updating multiple discrete pieces of information puts more demand on the transactional layer, which has to ensure atomicity and consistency. But I'm surprised it wasn't even mentioned as an alternative that was evaluated and rejected. The article starts out with the premise that a flat key/value approach is the only choice on the table.

100ms 22 hours ago [-]
Your approach looks interesting but I was curious when you talk about path-based splitting for ART, do you literally mean always on "/"? I know S3 directory buckets always use /, but the classical S3 model had no natural separator character and I was wondering if supporting those styles of prefix or custom delimiter queries suffered any impediment in your approach.

Bookmarked your whole blog for later consumption, interesting stuff!

thomas_fa 21 hours ago [-]
Thanks for the encouragement! Another author here. Yes, if you are interested you can check our another blog [1] for the internal storage engine. Yes, we are limiting the delimeter to "/", to better support posix FS semantics. I have just finished the fs feature branch which has passed all posix fstests [2].

[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...

[2] https://github.com/pjd/pjdfstest

22 hours ago [-]
seastarer 21 hours ago [-]
It's more correct to use O_DSYNC in addition to O_DIRECT. This adds FUA to the disk write if the disk requires it for durability.
thomas_fa 21 hours ago [-]
Yes, that has also been pointed out in other threads. Yes this could be very important settings, and even some of common Linux file systems actually don't do that every time and we need to disable the disk writecache during boot up to make sure the data truly persistent (as in my previous storage company).
seebeen 20 hours ago [-]
So instead of saying "We removed fsync" you should say: "We redesigned the database write path to avoid paying the full fsync durability cost on every write"
uroni 19 hours ago [-]
In my similar project (s3 compatible single-node storage) https://github.com/uroni/hs5 I do use proper fsync for data and metadata durability. But it can be turned of via switch. It is a pet peeve of mine that the defaults should always be to fsync. I do have a section on this in my README of the project.

I also do have an optional WAL. Maybe I should add an additional mode that disables fsync only for the WAL. I don't think it would be a good idea. My WAL does use checksums and sequence numbers etc. to prevent committing wrong data.

ovaistariq 9 hours ago [-]
There is no way to reliably prove that bytes have made their way to the disk without issuing fsync. Thus, without it you cannot guarantee that writes ACKed to the client survive any failure afterwards
loeg 21 hours ago [-]
This design ACKs writes that aren't yet durably persisted (to the journal or data areas). That might be ok, but it might not. It's certainly unusual not to at least persist the journal update.
zzsheng 14 hours ago [-]
nop. we will not ack any write which is not in data or journal. please check the put details in the blog.
loeg 13 hours ago [-]
You initiate a write to the journal, but do not sync it before ACKing to the client.
zzsheng 13 hours ago [-]
journal file was pre-alloacated and we use direct-io for journal write so no need to call fsync.
loeg 12 hours ago [-]
Again, it is not durably persisted before acking to the client. Like I said earlier, that might be fine for your durability model, but it is unusual.
thomas_fa 10 hours ago [-]
We would wait for Bss data and journal DirectIO and the acking (sending response back to api_server) in the callback function. What you are implying is what s3 actually doing and you can get see from their paper[1] and we are stronger than that.

[1]https://www.amazon.science/publications/using-lightweight-fo....

12 hours ago [-]
alexhnn 3 days ago [-]
Working with files is hard [1], and most of the complicity is from the fsync API. I am glad it can be eliminated from a kv storage engine.

[1] https://news.ycombinator.com/item?id=42805425

seebeen 21 hours ago [-]
So basically, you are writing data without guarantees it's actually written? "YOLO mode" but for data written to a device?

Would you be so kind to explain what happens in a power-loss scenario?

21 hours ago [-]
jnwatson 15 hours ago [-]
If you're bypassing the page cache, what invalidates the page cache so that the next read (from the filesystem) isn't stale?
zzsheng 14 hours ago [-]
we also use direct-io for reads.
bawolff 22 hours ago [-]
Am i understanding correctly that you are just targeting consistency and not durability?
zzsheng 14 hours ago [-]
actually both crash consistentcy and durability. after we ack, we make sure data will be lost due to crash, restart or power loss.
zzsheng 2 hours ago [-]
sorry, typo. data will *not* be lost.
dboreham 24 hours ago [-]
Almost full-circle back to when Oracle took over the entire volume and implemented its own filesystem.
dale_glass 23 hours ago [-]
I wonder why this is not more common. LVM is easy to set up, and it's already common to allocate volumes for things like disk images for VMs, so why not databases?
jandrewrogers 21 hours ago [-]
Some Linux filesystems, notably ext4 and XFS, provide the necessary features to get 90% of the benefit simply by using O_DIRECT correctly. The last 10% is achieved by doing direct I/O to raw block devices, with the obvious caveat that this is not as easy to manage.

Both of these are commonly done in database storage engines.

tptacek 22 hours ago [-]
If you preallocate and O_DIRECT, haven't you basically soaked up most of the benefit of skipping the filesystem?
pizza234 23 hours ago [-]
Because the speed increase is - on modern, properly tuned filesystems - surprisingly small, due to how RDBMS's manage their pool; by working on large container files, they avoid most of the filesystem overhead.
up2isomorphism 13 hours ago [-]
S3 was never designed for performance. Trying to be compatible while going with very hardware dependent low level optimization seems to be a wrong direction to begin with.
zzsheng 13 hours ago [-]
check s3 express one zone
up2isomorphism 13 hours ago [-]
The repo seems to contains some api gateway, and none of actual storage engine is open sourced. I did it so you don’t have to waste your time to find out.
7e 22 hours ago [-]
This is really great work. Kudos to the team for such an elegant solution.
thomas_fa 21 hours ago [-]
Thanks for the kind words! You check more of our work in https://github.com/fractalbits-labs/fractalbits.
CalmBirch127 4 hours ago [-]
[dead]
HollowRidge427 4 hours ago [-]
[dead]
hpcgroup 24 hours ago [-]
[flagged]
WindyBolt907 22 hours ago [-]
[dead]
QuietLedge375 10 hours ago [-]
[dead]
QuietLedge375 22 hours ago [-]
[dead]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 13:43:23 GMT+0000 (Coordinated Universal Time) with Vercel.