Postgres Count Performance

rowanseymour · on Oct 18, 2016

Even if counts could be made faster, at scale you'd probably still want to avoid counting anything that can be pre-calculated.

We use something similar to the trigger-based method they describe, tho have found that a lot of updates to count table inevitably ends with deadlocks. So instead of updating a count value, we always insert a new count of 1 or -1, and use summing to calculate the total count as needed. A background task is responsible for continually squashing the count values.

pgaddict · on Oct 18, 2016

Right, I was going to point out that one of the main issues with the per-table tally (trigger-based) is limited concurrency, as all inserts/deletes have to modify the same row. Deadlocks (due to updating counters for different tables) are an extreme case of this. The solution (or rather a mitigation reducing the chance of a deadlocks) is having multiple counters for each table (say, 32), and updating only one of them (e.g. based on backed PID). The trouble is that this increases the size of row_counts table.

Another issue with the global tally is that the row_counts table bloats quite a bit, because the trigger is executed per-row and the update creates a copy of the row (so the next invocation has to walk all the previous MVCC copies, causing the long INSERT time). I wonder whether statement-level triggers might be used here, somehow (I don't think so).

_pctq · on Oct 18, 2016

Great solution, thanks for sharing.

At this point, I wonder: with a huge dataset, wouldn't you have better time leaving the count field out of postgres altogether and use something like redis with its INCR/DECR instructions instead? This would prevent having deadlocks as well.

EDIT: that is, if you don't need to use the count field in other queries.

rowanseymour · on Oct 18, 2016

We've found that Redis works great for counting some things, but when you need a count which is guaranteed to match what is in a db table, it's preferable to manage counts at a db level. Often we're making counts based on the state of a record, and it's easy for the application level view of things to end up stale.

mey · on Oct 18, 2016

The middle ground to this sounds like a skip-list, but I haven't seen that implemented inside a SQL system. Maybe a trigger on inserts to the count row to randomly insert the skip list count. Problem would be, this would essentially require ordering and a transaction, so maybe could be calculated offline in an async way in some way.

brightball · on Oct 18, 2016

Solid solution. I've done similar by just writing counts to Memcached and then periodically persisting them to the database. If I you don't have queries that actually need the counts the persist time can be stretched out to much longer windows too.

eveningcoffee · on Oct 18, 2016

This is very elegant solution.

pgaddict · on Oct 18, 2016

A few comments regarding the count_distinct extension (author here):

> Note that custom extensions written in C like count_distinct are not bound by the value of work_mem. The array constructed in this extension can exceed your memory expectations.

This is slightly inaccurate, as this is not specific to custom aggregates. Custom aggregates are estimated just like any other (built-in) aggregates - number of expected groups times memory per group. The trouble is that (a) for aggregates with variable-length per-group state, we don't have a good size estimate, and (b) if the planner decides to use HashAggregate, we're unable to do anything when reaching work_mem. But this has nothing to do with the aggregate being custom - array_agg() and string_agg() have the same issue, for example.

FWIW, the extension was written quite a long time ago - before the various sort optimizations made by Peter Geoghegan. I wonder whether that made count_distinct obsolete.

life_is_short · on Oct 18, 2016

> for aggregates with variable-length per-group state, we don't have a good size estimate

Can you say a bit more why?

pgaddict · on Oct 18, 2016

No one implemented a better solution ;-)

Jokes aside, it seems simple but is fairly tricky, as it depends both on input data and various other parameters. For example for array_agg() or string_agg() it might be estimated from number of entries / average length. For hll it also depends on the accuracy and expected number of values to track, etc. I was considering adding another method to the API, providing a better estimate, but never got to that.

So the current code simply assumes 1KB (IIRC) per group in those cases, or something like that.

Of course, the memory estimate also depends on the number of groups, but that's an orthogonal issue.

JoelJacobson · on Oct 18, 2016

No mention in the article of the new parallel query support in 9.6 that can speed up COUNT(*)?

https://www.postgresql.org/docs/current/static/parallel-plan...

xomateix · on Oct 18, 2016

We were doing some benchmarks recently on different databases. Most of the queries were heavy aggregations (>1billion of rows) and, to be honest, we were a bit disappointed with the new parallel query support in pg, we were expecting much better performance.

While doing the benchmarks, we could see that citus was always taking full advantage of all the cores in the cluster, while postgres parallelization was not.

Disclaimer: I'm not a db expert and I don't have any relationship with either citus or postgres.

orf · on Oct 18, 2016

> On the other hand count(1) takes an argument and PostgreSQL has to check at every row to see that ts argument, 1, is indeed still not NULL.

I expect that's the same for all function calls like this? Surely pg has a concept of constants and doesn't needlessly re-check parameters?

DonPellegrino · on Oct 18, 2016

It checks that the first column of each row is not null, that's what the '1' stands for.

_pctq · on Oct 18, 2016

Super in-depth analysis, thanks to the author for writing it.

Past the recommendation of counting based on an indexed column, I wonder if this should really be user's concern. This paragraph especially triggers a "this should be fixed upstream" feeling in me:

> A word of warning. When work_mem is high enough to hold the whole relation PostgreSQL will choose HashAggregate even when an index exists. Paradoxically, giving the database more memory resources can lead to a worse plan. You can force the index-only scan by setting SET enable_hashagg=false; but remember to set it true again afterward or other query plans will get messed up.

But worst case scenario, this article will be useful until this is fixed, so thanks again :)

barrkel · on Oct 18, 2016

If you think "needs upstream fix", then no database is suitable for use, because they all choose poor plans in many edge cases. This is why most databases have extensions that let you hint or force index use (not Postgres, however; a mistake with mitigation), or support parenthesising your joins to force an evaluation order (this is an indirect way of forcing index use or join order, a mitigation), etc.

Query planning is something where a poor choice can have serious performance ramifications, because n is usually much larger than in most programs. Analyzing the algorithmic complexity of a piece of SQL takes some experience and experimentation, and with different table stats the query planner may make different decisions. It can be worthwhile limiting the planner's discretion to get more predictable performance.

(I work on a product where many of the features can be expressed in terms of relational algebra. Often, both the best performing and quickest to write and test implementation logic is a bunch of SQL, and not the kind of CRUD that is easily wrapped with an ORM. What would make my life easier is a SQL linter that, given a model of costs, would prevent people writing queries that scale only linearly with specific table sizes. I've accumulated sufficient intuition that I could do this for MySQL at this point.)

_pctq · on Oct 18, 2016

Optimizing queries manually is fine by me, what troubled me was more the fact that it was recommended to change a configuration setting on the fly, which sounded like asking for problems regarding concurrent requests.

But actually, I've just made a test, and it appears changing this setting only impacts the current connection, so provided it's toggled back after the request, this should not be a problem.

pavlik_enemy · on Oct 18, 2016

Though I wonder how it would work with pg-bouncer.

pavlik_enemy · on Oct 18, 2016

You can disable certain scan types in Postgres and that's something I had to do when no matter what it reported incorrect stats.

honkhonkpants · on Oct 18, 2016

Problems of this kind are the reason I always wanted a dbms to offer a direct plan API instead of a sql parser. It's dumb to fiddle with the statement until you get the desired plan. You should be able to just call for that plan.

andrea_s · on Oct 18, 2016

It's a bit odd there's no mention of the PG columnar store in this article (https://www.citusdata.com/blog/2014/04/03/columnar-store-for...) - especially since it's from the same company.

It would be interesting to see how much the performances improve once you use cstore_fdw (especially since 1M records is quite small when talking about OLAP workloads).

disclaimer: I've never used cstore_fdw, but I have evaluated a number of columnar databases in the past.

ozgune · on Oct 18, 2016

(Ozgun from Citus Data)

We find that the primary motivation for using cstore is reducing disk I/O / storage footprint. cstore_fdw keeps a columnar layout on disk in compressed form and reads only relevant columns. For example, it's commonly used for data archival purposes.

That said, cstore_fdw doesn't yet make optimizations related to query planning and execution. We made experiments in that direction (https://news.ycombinator.com/item?id=8423825), but making those changes production ready is no small effort.

Since all benchmarks in this blog post are for in-memory data, I don't know how much they would benefit from cstore. If I have the time, I'll give it a try and update this comment with the results.

buremba · on Oct 18, 2016

I think cstore_fdw is not popular enough among Citus users. Only a few of their customers use it since it's not trivial to use cstore_fdw in real-time workloads. Given than its use-case is mainly analytics, it seems a bit odd though.

throwawayReply · on Oct 18, 2016

The counting seems really slow compared to MSSQL? I'm not familiar with the reference hardware used here, but just running a similar test on my desktop with sql server it can count distinct a million strings in under a second (with parallelisation, around 2.5s cpu time).

Or am I missing the fact that these benchmarks are run on a reference spec which is comparatively old?

sb8244 · on Oct 18, 2016

I'm not familiar with mssql, does it use mvcc? That is the reason PG counting is slow (really slow)

throwawayReply · on Oct 18, 2016

I'm not an expert, this is my layman's understands.

It has different isolation levels, some involving snapshots and some not. I think most concurrency issues are (by default) dealt with by locks, which start at row level and can escalate to page and table level (with significant slowdown seen when lock escalation happens in contentious places).

But that only has an effect if it's under write.

sbuttgereit · on Oct 18, 2016

This is historically correct, but I believe later versions of mssql server now default to their mvcc implementation. I think this switch was circa 2005; not 100% sure.

I managed an enterprise applications group that primarily used mssql for data in 2007. I can't recall which mvcc implementation they use. Our servers were MSSQL 2000 and I remember being a bit more than surprised when the DBAs told me the root of the performance problems we had were due to lock escalation; having come from the Oracle and PostgreSQL worlds, I was naive enough to have thought that lock escalation implementations like this were historical curiosities rather than something I'd actually run into... live and learn I guess.

nickpeterson · on Oct 18, 2016

I believe it still has to be chosen, and the MS term is 'Read Committed Snapshot'. Everything is a tradeoff, read committed snapshot makes your tempdb busier (since it is involved in versioning), and balloons up the default size of a row in a table. With RCS you end up adding 14 bytes of overhead to each row, and which otherwise wouldn't be there.

So imagine a table with a single integer column and 1 billion rows. Instead of the 9 byte per row overhead (7 bytes for row metadata and 2 bytes for the page offset), you instead have 23 bytes, plus the 4 bytes to hold the int32. Without RCS on, that row would only have a 9 byte overhead and the rowsize would be 13 bytes.

so 1billion * 27bytes = 25.14 GB 1billion * 13bytes = 12.11 GB

There are some other performance tradeoffs (walking row version values, pagesplits when updating records without rowversions after converting the database, et cetera).

Ultimately, MVCC is great for contention, but it stinks if you're trying to efficiently pack in data.

For an alternative perspective, I sometimes bemoan the size of tables in postgres because of the mandatory overhead and versioning.

sbuttgereit · on Oct 19, 2016

Agreed. Table bloat is no fun either. Though, I would say under the most common workloads, it's easier to manage table bloat than it is lock escalation... I can pick my battleground for table bloat whereas lock escalation is immediate.

Oracle's MVCC method doesn't have that problem... but then you get the imfamous ORA-01555, "Snapshot too Old" from time to time.

Oh well, no perfect worlds I suppose.

loopdoend · on Oct 18, 2016

loopdoend · on Oct 18, 2016