Pagination CountWalker should run a COUNT(*) query instead of COUNT(tbl.id) when HINT_DISTINCT is false #11552

d-ph · 2024-07-16T15:32:51Z

Feature Request

Q	A
New Feature	no
RFC	no
BC Break	no

Summary

I'd like to propose that the Doctrine\ORM\Tools\Pagination\CountWalker should create a count query that selects COUNT(*) instead of COUNT(tbl.id) when a query's HINT_DISTINCT is set/declared false. Both "counts" result in the same number being produced, however COUNT(*) allows some databases (e.g. Postgres) to finish the query faster. Please see the following query plans.

// COUNT(tbl.id) case
EXPLAIN SELECT count(o0_.id) AS sclr_0 FROM "order" o0_;

QUERY PLAN
Finalize Aggregate  (cost=40598.68..40598.69 rows=1 width=8)
  ->  Gather  (cost=40598.46..40598.67 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=39598.46..39598.47 rows=1 width=8)
              ->  Parallel Seq Scan on ""order"" o0_  (cost=0.00..38594.17 rows=redacted width=4)

---

// COUNT(*) case
EXPLAIN SELECT count(*) AS sclr_0 FROM "order" o0_;

QUERY PLAN
Finalize Aggregate  (cost=36422.71..36422.72 rows=1 width=8)
  ->  Gather  (cost=36422.49..36422.70 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=35422.49..35422.50 rows=1 width=8)
              ->  Parallel Index Only Scan using idx_f5299398f543e763 on ""order"" o0_  (cost=0.42..34418.20 rows=redacted width=0)

Notice how in the case of COUNT(*), Postgres is counting using an "Index Only Scan". The speed difference isn't substantial (at least for the number of rows I tested with), but it's also quite cheap to obtain (by just making the code use COUNT(*)).

Thank you.

The text was updated successfully, but these errors were encountered:

greg0ire · 2024-07-16T16:01:31Z

Do you have an explanation as to why the query plan differs from one query to the other, and are you sure that explanation will apply regardless of the database in use? I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference. But I agree that "Seq Scan" does not sound good.

d-ph · 2024-07-17T10:27:41Z

Do you have an explanation as to why the query plan differs from one query to the other

I've done some research just now and the short answer appears to be that COUNT(*) gives more opportunities to the database to seek alternative ways to fulfil the query than when COUNT(tbl.id) is used.

My other observations:

A lot of sources state that COUNT(column_name) tells the database to count not-null values of the column_name, while COUNT(*) tells the database to just count the rows, regardless of their contents. Postgres in particular is sometimes willing to use an index-only scan to fulfil the COUNT(*) query. The apparent non-obvious thing here is that COUNT(*) is an equivalent of COUNT(), but it's not COUNT() because that's a syntax error in SQL.
Despite the above, I found that both COUNT() queries allow Postgres to attempt to use an index-only scan, however COUNT(*) allows to use any index available on the table in question, while COUNT(tbl.id) allows to use only the (usually primary) index on the id column. This for some reason is sometimes less eagerly used if there's been a period of time since the last vacuum analyze was carried out on the table in question, because when I vacuum analyze'd my test table, then Postgres started using an index-only scan also for the COUNT(tbl.id) query.
The time difference between a "parallel seq scan" and a "parallel index-only scan" seems to be under 20ms for my test tables (in favour of the index-only scan). It makes me believe that Postgres is just accurately concluding that there will be no difference between the two, and just chooses to sometimes use a sequential scan in case of the COUNT(tbl.id).

are you sure that explanation will apply regardless of the database in use?

I'm not sure but it's very plausible. For what it's worth, chatgpt is under impression (and bullet-points some examples) that most of the databases benefit from being told to do COUNT(*) instead of COUNT(column_name), when the column_name is not nullable (and primary keys aren't supposed to be nullable).

I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference.

After going down the rabbit hole here, I'm also less inclined now to make this change. The only standing argument for that change is what I said at the top: COUNT(*) gives more opportunities for optimised querying than COUNT(tbl.id), however, as I later said, I wasn't able to confirm spectacular speed difference in my test data.

For this reason I'm fine to close this ticket now without any code change, unless you'd like (me) to make that change anyway due to the reason I explained above.

greg0ire · 2024-07-17T18:35:18Z

Thanks for the very detailed explanation. From what you are telling, I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (I'm leaving aside the case of composite primary keys for the sake of simplicity). Is that what you observe ?

d-ph · 2024-07-18T15:53:18Z

I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (...) Is that what you observe?

Interesting. You're absolutely right. When I add a simple where-condition on a column that is indexed, the index-only scan is never used in the case of COUNT(tbl.id) (regardless of whether the table is vacuum'd or not), but is always used when COUNT(*) is used. In this particular case, I can see 50-75ms speed difference in favour of COUNT(*). Whether that's significant is open to debate.

I'm leaving aside the case of composite primary keys for the sake of simplicity

The CountWalker already throws an exception when the table being queried uses a composite primary key. So we don't need to discuss that case here.

So yeah, still down to you if you wish to carry on with this code change. I'm somewhat more inclined to it now.

Edit: I just tried different where-conditions, and as one could imagine, if the where-condition filters out significant amount of the table, the speed difference is significant. E.g. 0.1s vs 0.001s.

greg0ire · 2024-07-18T16:34:16Z

Great! Using Git, can you find out why we are not using (*) in the first place? Likewise, if you switch the code to (*)< do any tests break?

d-ph · 2024-07-19T13:26:20Z

why we are not using (*) in the first place?

Git history shows that the code does what it does since the very beginning (2012) and it wasn't attempted to change it to conditionally doing COUNT(*).

if you switch the code to (*)< do any tests break?

I made that change and no tests failed. In particular, CountWalkerTest didn't fail because it never checks for the case when the HINT_DISTINCT is set to false.

I checked this code change in my webapp and it's effective (i.e. COUNT(*) is run conditionally), and I didn't experience any regressions.

greg0ire · 2024-07-19T14:29:59Z

Then please send a PR against the next minor branch, I think I'm OK with this change 👍

d-ph · 2024-07-19T16:00:41Z

Could you tell me whether you mean the 3.3.x or the 3.2.x branch, please?

It that's an option, I would backport it to 2.20.x and then "replay" the code change from 2.20.x to every branch above it.

greg0ire · 2024-07-19T18:54:14Z

2.x is only open for bugfixes. This does not look like a bug at all. The next minor branch is 3.3.x (since the last tag is 3.2.1).

…se (doctrine#11552) This change makes CountWalker use COUNT(*) instead of COUNT(tbl.id), when the user declared that their query does not need to use (SELECT) DISTINCT, which is commonly the case when there are no JOINs in the query, or when the JOINs are only *ToOne. Research showed that COUNT(*) allows databases to use index(-only) scans more eagerly from any of the indexed columns, especially when the query is using a WHERE-condition that filters on an indexed column.

…nt-star-query-sometimes Make CountWalker use COUNT(*) when $distinct is explicitly set to false (#11552)

d-ph mentioned this issue Jul 22, 2024

Make CountWalker use COUNT(*) when $distinct is explicitly set to false (#11552) #11557

Merged

greg0ire added a commit that referenced this issue Aug 19, 2024

Merge pull request #11557 from d-ph/feature/make-count-walker-use-cou…

5f1fe15

…nt-star-query-sometimes Make CountWalker use COUNT(*) when $distinct is explicitly set to false (#11552)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pagination CountWalker should run a COUNT(*) query instead of COUNT(tbl.id) when HINT_DISTINCT is false #11552

Pagination CountWalker should run a COUNT(*) query instead of COUNT(tbl.id) when HINT_DISTINCT is false #11552

d-ph commented Jul 16, 2024

greg0ire commented Jul 16, 2024

d-ph commented Jul 17, 2024

greg0ire commented Jul 17, 2024

d-ph commented Jul 18, 2024 •

edited

Loading

greg0ire commented Jul 18, 2024

d-ph commented Jul 19, 2024

greg0ire commented Jul 19, 2024

d-ph commented Jul 19, 2024

greg0ire commented Jul 19, 2024

Pagination CountWalker should run a COUNT(*) query instead of COUNT(tbl.id) when HINT_DISTINCT is false #11552

Pagination CountWalker should run a COUNT(*) query instead of COUNT(tbl.id) when HINT_DISTINCT is false #11552

Comments

d-ph commented Jul 16, 2024

Feature Request

Summary

greg0ire commented Jul 16, 2024

d-ph commented Jul 17, 2024

greg0ire commented Jul 17, 2024

d-ph commented Jul 18, 2024 • edited Loading

greg0ire commented Jul 18, 2024

d-ph commented Jul 19, 2024

greg0ire commented Jul 19, 2024

d-ph commented Jul 19, 2024

greg0ire commented Jul 19, 2024

d-ph commented Jul 18, 2024 •

edited

Loading