NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
The only scalable delete in Postgres is DROP TABLE (planetscale.com)
crazygringo 1 days ago [-]
Only by a weird definition of "scalable". The first sentence says:

> Counterintuitively, large DELETEs add work to the database.

There is nothing counterintuitive about this. It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it? Obviously you have to do almost all the same operations: write a log, write the deletion, update indices, replicate it, etc.

And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table. Since DROP TABLE is ~instantaneous, mainly involving table-level metadata.

DELETE scales just fine, in the sense that if you are constantly inserting and deleting individual rows, DELETE scales the same as INSERT.

Basic database functionality is designed around the assumption of lots of small transactions. Whenever you have to do something involving millions of rows at once, you generally need to investigate solutions that work well in "bulk". E.g. loading rows directly from a file rather than with SQL, adding indices only after the data has been loaded rather than before, disabling foreign key checks on large operations (if you know by design that the keys are valid)... and yes, taking advantage of DROP TABLE instead of DELETE. This doesn't mean small transactions aren't scalable, it just means bulk operations are qualitatively different and benefit from their own solutions. And DELETE is no different from INSERT in this regard.

setr 1 days ago [-]
> It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it? Obviously you have to do almost all the same operations: write a log, write the deletion, update indices, replicate it, etc.

It takes far more work to delete/update than insert. My recent example is updating ~2TB of text data was about 40x slower than inserting 12TB (was trying to correct some large text truncation that occurred during migration into PG, ended up being faster to redo).

crazygringo 1 days ago [-]
> It takes far more work to delete/update than insert.

Updating rows of text data is going to be more work, because variable-length text can't be updated in-place. So in terms of allocating space, it's more like a delete plus an insert. That's not surprising. (An in-place update that doesn't touch indices is generally going to be faster than an insert, though.)

I'm not aware of instances where a delete is "far more work" than an equivalent insert though. That's not the general case, and I'm having a hard time thinking of any situations where that would be true.

gopalv 1 days ago [-]
> So in terms of allocating space, it's more like a delete plus an insert.

Unless you're using zHeap, you have a narrow Heap-only-Tuples scenario where the indexes stay the same. TOAST kinda helps there, if the update is off the tuple area itself. The original zHeap docs have a lot of detail about why an UNDO log can help with long running transactions from the past etc.

That is a postgresql specific thing though. Mysql indexes were created with the idea of different storage engines in mind, so Mysql doesn't suffer from the index update ovehead on update/delete the same way.

Uber had a long blog post about switching to Mysql from Postgres for wide tables with hundreds of indexes. The HN entry is still there[1], but I can't read the original post now.

As a side note, I've used postgres partitions to the same effect to drop old data periodically - detach and then drop the partition instead of a direct DELETE (similar tricks in HBase existed).

[1] - https://news.ycombinator.com/item?id=10894047

stuartd 19 hours ago [-]
> The HN entry is still there, but I can't read the original post now.

The post on IA - https://web.archive.org/web/20160304013342/https://eng.uber....

pixl97 1 days ago [-]
Not directly database related, but when it comes to writing files on disks, deletes on SSDs can be rather expensive because of the delete block size vs a simple write.
consp 9 hours ago [-]
Doesn't the block simply get marked as deleted and only wiped on a write?
22 hours ago [-]
fidotron 1 days ago [-]
> I'm not aware of instances where a delete is "far more work" than an equivalent insert though. That's not the general case, and I'm having a hard time thinking of any situations where that would be true.

Transactionally across related items with constraints it can explode fast.

If you've ever used FoundationDB this rapidly becomes the defining PITA due to the transaction size limits. Adding/inserting/updates are all far more predictably bounded.

tremon 1 days ago [-]
But in that case, you need to compare like-for-like with the situation where you need to insert all the prerequisite rows too. You can't just compare a delete cascade with a single insert where all the foreign keys are already satisfied.
fidotron 22 hours ago [-]
The whole problem with the delete cascade is you can't tell how big it will be until you have entered the transaction to do it. An insert you either know or it will fail and you can retry.
tremon 21 hours ago [-]
That's true, but now you have moved the goalposts. The original claim upthread was "it takes just as much work to delete a row as it takes to insert a row", not "it's hard to predict the performance of a delete with cascade effects". And the obvious rebuttal to that is that it's equally hard to provide an upper bound for the runtime of a single insert: an application cannot control the other processes running on the database, some of which may delay, interfere with or even invalidate your query and you must account for that. A delete operation is just as much "it might fail and you can retry" as an insert, or the database you're working with isn't ACID-compliant.
fidotron 21 hours ago [-]
> And the obvious rebuttal to that is that it's equally hard to provide an upper bound for the runtime of a single insert

This is precisely where you're going wrong. The insert is upper boundable in advance (you know the set of everything you might potentially have to insert), the delete isn't because you don't know what's in the db until you look.

I strongly recommend poking around with Foundation for this, because it becomes clear that this problem is the defining flaw with the way they tried to architect with layers, to the point they have a queuing system for processing large jobs of this type.

ElectricalUnion 20 hours ago [-]
> The insert is upper boundable in advance

A concurrent DML happening then suddenly your MERGE INTO WHEN NOT MATCHED INSERT/INSERT INTO SELECT is way larger that you thought? I thought "some workloads can suddenly be way larger that I expected" was supposed to be a thing in all non-trivial DML.

tremon 8 minutes ago [-]
You don't even need a complex query; even the simplest of insert statements can cause cascade side effects if you have temporal tables or materialized views (or, Codd forbid, ON INSERT triggers).
drob518 6 hours ago [-]
Yea, this seems as obvious to me as why it’s more efficient to reformat a drive partition than to delete every file that might be stored there. Or why it’s more efficient to free a whole memory arena than to free every single memory block allocated within it. If you know you’re throwing everything away, it’s more efficient to invoke a “throw it all away” action than to throw away each piece individually.
echoangle 1 days ago [-]
> And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large a table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table.

Dumb question but why does the optimizer not just do that in secret then? Seems like something that should be detectable with some heuristics.

crazygringo 1 days ago [-]
Because what do you do if rows are being inserted in the original table, while the new table is having rows copied over? You'll get missing rows.

You can only do the DROP TABLE trick if you know nothing else is writing to the table at the same time. You know if that's the case, according to your business logic. The database has no idea.

The DROP TABLE trick effectively bypasses all the normal guarantees of data consistency. This is why it's so fast. But you have to know that that's a safe thing to do for your data.

nostrademons 1 days ago [-]
There are ways the DB could recover the data consistency guarantees, eg. keeping a log of operations that came in while the table was being copied over and then applying the relevant ones afterwards.

The tricky part is that the latency characteristics of these operations would be pretty surprising and unintuitive. It has the same problems as virtual memory and mark/sweep GC; sometimes, depending on system state and things that other threads are doing, an unrelated operation might block for very long time periods and give you huge user-visible pauses. It's often better to force these expensive operations to be explicit so that the developer has to think through the latency & consistency implications and make the tradeoffs they want.

convolvatron 1 days ago [-]
except in this particular case, as long as you don't exhaust resources, mvcc kind of lets you get away with making a transactionally consistent copy under the covers without blocking anyone, since its a big-ol read.
sgarland 1 days ago [-]
I assume partly because that would be extremely surprising behavior, and depending on the RDBMS and version, could introduce unexpected stalls. For example, MySQL < 8.0.23 scans the entire buffer pool to clear pages that were dropped, which can take a long time on large instances. There is / was a similar issue with its adaptive hash index, which AFAIK wasn’t ever fixed, though AHI’s default being shifted to OFF in 8.4 is a workaround, in a very hacky way.
Retr0id 1 days ago [-]
Maintaining the expected observable behaviours would get complicated if queries (especially other updates) against the same table are happening concurrently.
1 days ago [-]
layer8 1 days ago [-]
Because dropping a table effectively requires an exclusive lock on the table during that whole operation, affecting parallel transactions.
mordae 1 days ago [-]
It drops dependents.
umpalumpaaa 16 hours ago [-]
Does this “Drop hack” work well with foreign keys, triggers and constraints?
Retr0id 1 days ago [-]
> if you are constantly inserting and deleting individual rows, DELETE scales the same as INSERT

Technically correct, but for a small table with a high churn rate, the performance characteristics may be surprising in that the "n" in most big-O calculations includes all inserts since the last VACUUM, not the actual number of resident rows.

coldtea 22 hours ago [-]
>There is nothing counterintuitive about this. It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it?

Because e.g. DROP also effectively deletes the rows but takes way way less work.

winterbloom 1 days ago [-]
how does that solution work if the table that is dropped has foreign key constraints?
crazygringo 1 days ago [-]
That's why I said:

> disabling foreign key checks on large operations

And you have to know that, according to your business logic, what you're doing is safe.

globular-toast 11 hours ago [-]
> It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it?

Because your data structure/algorithm supports fast deletes? File systems support deleting entire directories instantly. I'm not aware of any fundamental reason why DELETE in a SQL database must take as long as an insert?

silversmith 6 hours ago [-]
I usually don't like the "read the article" replies, but in this case it's warranted - the second section explains how multiple transactions interact there, and the "deleted" record still needs to actually exist for older transactions.
dpark 5 hours ago [-]
> File systems support deleting entire directories instantly.

What file system supports this? I’ve never experienced the joy of deleting an entire large directory and seeing it disappear instantly.

I’m not sure how this would work with any modern file system that supports basic notions like hard links.

crazygringo 5 hours ago [-]
I suggest you read the following sentence of my comment, the one after your quote. It provides your answer.

And file systems deleting huge directories instantly is the equivalent of DROP TABLE here, which I also mention in my comment.

sgarland 1 days ago [-]
The same is true to a lesser extent in MySQL / MariaDB. It does better since it doesn’t do oldest-to-newest tuple chains, but it’s still adding non-trivial work to the DB, much of which is effectively wasted if you don’t care about the visibility of the deleted (or soon-to-be deleted) tuples to other transactions.

I sincerely hope that Planetscale’s efforts succeed long-term to shift devs’ understanding and acceptance of RDBMS operations. Their blog posts and docs are generally quite good. IME, devs (and even ops-ish teams) simply do not care about all of this, and will create elaborate bespoke tooling to run DELETEs in bulk, because they either don’t understand the capabilities of the database, or don’t want to deal with the [minor] increased complexity that a partitioned schema brings, and will happily pay the extra cost / latency for deletions.

kro 24 hours ago [-]
mysql/maria also lets you turn off/down the isolation level for queries if you know the guarantees aren't needed, to speed things up. I think postgres does not have that option.
alternbet25676 21 hours ago [-]
Postgres does support changing the isolation level at query, session and config: https://www.postgresql.org/docs/current/sql-set-transaction....
1 days ago [-]
saisrirampur 1 days ago [-]
Partially true but too much of a blanket statement and clickbaity.

DELETE with well-tuned autovacuum works pretty well. Have seen it work at TBs scale with no hicuups. If DELETEs are large, we used to recommend customers to follow that with a manual VACUUM for table to reclaim space right away for future rows.

DROP TABLE can be risky, it requires an ACCESS EXCLUSIVE LOCK and if its waiting, it blocks all other statements following it, because of how lock queues work in Postgres. And you cannot keep doing high concurrent DROP TABLEs to run your large scale CRUD app.

CharlieDigital 1 days ago [-]

    > And you cannot keep doing high concurrent DROP TABLEs to run your large scale CRUD app
In this kind of use case/design, I would assume it would make use of partitions to make this more palatable in which case it would seem that you would bypass this issue of "high concurrent DROP TABLE". Large scale CRUD app just points to recent-ish partitions. Old partitions are either going to be low or on access and can be dropped easily or transformed/transferred into some long term/cold storage.
saisrirampur 1 days ago [-]
(Most) CRUD/OLTP applications don't delete data by timestamp; they delete by primary key. For those workloads, DROP TABLE (or dropping a partition) isn't a palatable option.

The entire premise here is really about time-series workloads where most operations are based on a timestamp. In those apps partition dropping has been a standard and recommended retention strategy for years. That's precisely why extensions like pg_partman and TimescaleDB exist. Given that context, the title feels more clickbaity, and could easily mislead readers into thinking this applies broadly to OLTP systems when it doesn't;

parthdesai 6 hours ago [-]
> (Most) CRUD/OLTP applications don't delete data by timestamp; they delete by primary key. For those workloads, DROP TABLE (or dropping a partition) isn't a palatable option.

UUID v7 to the rescue!

saisrirampur 1 days ago [-]
Separately, this is one of the Postgres autovacuum tuning blog that I've ever read. Have seen it work across many customers and it is also simple to decipher and implement. https://www.citusdata.com/blog/2022/07/28/debugging-postgres...
jandrewrogers 1 days ago [-]
This generalizes to most (all?) databases. Selective deletion is largely an unsolved problem at scale in databases to the extent it doesn't release the deleted resources. Under the hood databases try to turn this into selective resource truncation, which scales much better, but in most cases that is not possible without careful design of your data model.

Similarly, you often have to remind devs that in many databases an UPDATE is just an INSERT + DELETE, with all of the scaling issues implied.

mike_hearn 9 hours ago [-]
RocksDB and other LSM tree backed databases do have cheap deletes and updates, although you could argue that's because they make everything else expensive. If you have spare cores it can be a good trade though.
cowthulhu 1 days ago [-]
IMO, needing to clear out an entire table is an indicator that something has gone wrong with your design.

Don't get me wrong, I've definitely done it before, but it's in the same bucket as VACUUM for me... high impact interventions used to fix a mistake I made, not "course of business" actions.

baq 1 days ago [-]
You should run vacuum as often as possible in Postgres if you’re doing anything other than INSERTs, this is a design tradeoff in Postgres itself. It’s the reason autovacuum exists and why tuning it is so important for performance; nothing wrong with doing a VACUUM ANALYZE after finishing a large DML batch job.
TurdF3rguson 16 hours ago [-]
what does as often as possible mean? It will auto-vacuum when it's idle, right? Why not just let it do that?
baq 11 hours ago [-]
What do you mean by ‘idle’?

If you mean your database is seeing extended periods of no updates to a table, you still want to vacuum, maybe even vacuum full if you know when traffic stopped and for how long to get the best possible read performance.

If you have quiet periods in both reads and writes, enjoy the luxury of having an unused database to operate.

saltcured 24 hours ago [-]
DROP DATABASE, for when a bunch of calls to DROP TABLE seems like too much overhead...
mike_hock 14 hours ago [-]
pg_dropcluster for when a bunch of calls to DROP DATABASE seems like too much overhead.
levkk 1 days ago [-]
CRUD apps don't usually delete in bulk. It's also hard to structure partitions in a way that doesn't wipe out months of important business data -- this is why teams often ETL their DB into Snowflake/ClickHouse and only then drop partitions. That makes it hard for the app to use that data again.

The better approach is either to change your storage engine (e.g. OrioleDB is working on adding the undo log to Pg), or to shard which distributes the vacuum load across multiple servers.

sgarland 1 days ago [-]
They should be performing bulk deletions, due to GDPR: “Data must be stored for the shortest time possible.” Unless you have some kind of rolling cron checking every few minutes (and even then, depending on your scale, that may well be considered bulk), that generally resolves to something like daily or weekly deletions.
buremba 18 hours ago [-]
Materialized tables are useful for time-series or sharding-like use-cases. You essentially offload the work to INSERT time to locate the data into relevant buckets/sub-tables that you can DROP later.

We use materialized views for append-only timeseries data for https://lobu.ai and the retention policies define how we DROP the tables so we don't DELETE/UPDATE any rows in the tables.

The long term storage is Iceberg on S3 that's ingested via Postgresql replication, suitable for OLAP use-cases. Postgresql only stores the dimensional OLTP data the users can update and the hot append-only event data.

insumanth 13 hours ago [-]
This mostly applies to almost any database.

Deletes are Writes and Writes are resource intensive. This is more prominent on databases like Elastic Search.

When I was tasked to delete millions of (old) documents, it overloaded the cluster and almost brought it down. Only scalable solution was to split the index and drop the whole index.

ankitml 19 hours ago [-]
This is directionally correct approach. Deleting a large chunk of rows, in a large table does lead to unpredictable-bad behaviour for a while until those dead tuples are handled.

I have used a very similar strategy by forking repack client https://github.com/reorg/pg_repack/pull/326 This works out of the box with rds/cloudsql etc.

elnerd 13 hours ago [-]
I’m using Postgres for my DNS log service. I only store data for 90 days. To delete data, my strategy is to use partitions based on month. At the start of every month, I drop one partition.

I am not sure of this is the best way to do this, but it works for me.

smithcheck4 1 days ago [-]
I have been using TRUNCATE allm the way, and is fast as very good.
flossly 21 hours ago [-]
Years ago I heard Oracle's db had an edge on PG when it comes to DELETEs. I guess that's still the case...
mike_hearn 9 hours ago [-]
Oracle has ALTER TABLE [name] MOVE ONLINE INCLUDING ROWS [predicates]. This is basically what you want - it keeps only the data matching the predicate, it's efficient and it works without disrupting the app. That might be the edge you heard about.
asgeirn 8 hours ago [-]
Also ALTER TABLE DROP PARTITION
foreigner 1 days ago [-]
Surprisingly to remove small numbers of rows in multiple tables (e.g. cleanup between automated tests), DELETE is often faster than TRUNCATE! It's counterintuitive but just measure it for yourself and see. Note you can DELETE from multiple tables in one statement using CTEs, and that way you don't need to think about foreign key dependency order.
inigyou 1 days ago [-]
This website appears to be hard blocking my IP address (drops all packets( so I can't read the article.
pstuart 1 days ago [-]
Yep, partitions are the way to go there.
awinter-py 1 days ago [-]
^ this

been exploring clickhouse and while it is definitely not a general purpose DB, for time-series shaped data that can survive some insert latency, the automatic partition-based TTL is very nice and, at least so far, requires zero attention to maintain

which I guess is solved by `pg_partman` at the bottom of the post

twotwotwo 1 days ago [-]
Years ago work was bit by the analogous thing in MySQL. Like it usually does, it took a chain of events:

- We wrote a cronjob to periodically DELETE for a retention policy on a table we'd just created. Most senior person on the team reviewed it, looked fine.

- Unusually for us, we prioritize QA'ing a different feature for release, delaying the release of this cronjob and a bunch of other code.

- During that delay, the new table accumulated many times more rows to be deleted than we'd expected during review.

- Release happens. All looks well since the initial delete wasn't a migration and cronjob hasn't run yet; engineer doing the release signs off.

- Cronjob runs, deleting hundreds of millions of rows quickly.

- Next day, replica lag's high and MySQL's transaction history is very high. MySQL keeps transaction history around until purge threads have visited all the affected pages on disk.

- The bad cluster conditions last for days and lead to other problems.

This omits detail and the 'noise' of everything else we were watching. But it gets across how the code and MySQL behaved.

Like most exciting events, it led to multiple changes to avoid a repeat. For retention policies, our new approach was one at the end of PlanetScale's post, to partition and drop old partitions. Transitioning to this from a huge unpartitioned table can be fun!

If a table is append-only and already huge, with lots of rows already past the retention threshold, you might only copy the rows to be kept to the new partitioned table: copy what you can, lock tables, do a last catch-up copy and swap tables. (Roughly the blog's 'performant one-off delete'.)

If the table's merely kind of big, gh-ost or such could allow you to ALTER without causing lag, locking, etc.

At a scale below that, you could run a slow incremental 'nibble' delete while watching server stats, and a step below that, plain ALTERs or DELETEs are fine.

Using partitioning has fun bits, too. In MySQL, the partition key has to be part of any unique index, understandably. But you have to keep that in mind when you're using INSERT..ON DUPLICATE KEY UPDATE and relying on uniqueness to trigger the update. Things stay interesting!

I hear Vitess shops like PlanetScale usually don't run multi-terabyte myqsld instances in the first place: even when physical nodes are big, they run many smaller mysqlds on them. That wouldn't make all this fully irrelevant--huge deletes would still sometimes be worse than copy-swap-drop--but it does seem real handy for taming issues that tend to worsen with mysqld size, like replication lag. All to say, little bit jelly of their setup over there!

DarkUranium 8 hours ago [-]
One thing I did a while ago was to make deletes part of inserts, to amortize the cost.

The main reason was to avoid a separate cron job, but it had other benefits (and downsides) too. Something like:

    DELETE FROM foo WHERE expires_at < now() LIMIT 10;
    INSERT INTO foo .....;
Note the LIMIT: it ensures the latency stays under control even if we've suddenly hit 50k rows that need deleting.

And by deleting (up to) 10 each time we insert one, it ensures obsolete things will eventually get deleted.

Obviously, this isn't viable when the deletions must happen due to strict policy (e.g. legal compliance) since it can't ensure when things get deleted, just that they eventually do. IIRC, in my case, I used it for a password reset tokens table. There's no legal issue there and keeping expired ones around is fine as long as the code also checks `expires_at` to make sure it's still valid (which would be a good practice regardless, for defense in depth).

silvestrov 6 hours ago [-]
Problem is that PostgreSQL does not support LIMIT on the DELETE command.

I have no idea why, it seems such an obvious feature for supporting large databases.

znnajdla 14 hours ago [-]
Why are databases so hard?
DarkUranium 8 hours ago [-]
Because of the guarantees they provide.

Storing some data in a binary file isn't very hard. Making it so that you can do quick lookups on it (indexes) and implementing joins in a sane way is kinda hard, but easy compared to the real problem:

Ensuring ACID (in the case of "traditional" databases). I.e. Atomicity, Consistency, Isolation, Durability.

You need to protect against data corruption in the event of failure, all while guaranteeing atomic operations at the user level concurrently (in most production DBs; SQLite is a notable exception in that it fully serializes writes --- but it can get away with this because it's an embedded database with the primary use case of a single-process writer). And the entire thing must land on a known good state at the end of all of those concurrent transactions.

... and they must do it all while maintaining good performance, and sometimes on a combination of filesystem + hardware that's actively hostile towards the idea of data integrity (e.g. hidden RAM caches in disk or RAID controllers that don't flush on power loss --- thankfully, those are getting rarer, or so I've come to understand).

xenator 1 days ago [-]
Deleting whole file is faster then deleting rows in file.
24 hours ago [-]
nicman23 6 hours ago [-]
funnily enough i had a broken mysql / mariadb install that drop would not work at all - pinned a core to 100% - but delete would be instant in 1M rows lmao
Syzygies 17 hours ago [-]
I can't believe I'm the first to Rick roll this thread with the most famous XKCD comic of all time:

https://xkcd.com/327/

TurdF3rguson 16 hours ago [-]
I can because this thread is not about sql injection.
missingdays 8 hours ago [-]
But there's DROP TABLE in both the title and the comic, so it must be relevant
rfsck4 12 hours ago [-]
[dead]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 19:13:49 GMT+0000 (Coordinated Universal Time) with Vercel.