Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination CountWalker should run a COUNT(*) query instead of COUNT(tbl.id) when HINT_DISTINCT is false #11552

Open
d-ph opened this issue Jul 16, 2024 · 9 comments

Comments

@d-ph
Copy link

d-ph commented Jul 16, 2024

Feature Request

Q A
New Feature no
RFC no
BC Break no

Summary

I'd like to propose that the Doctrine\ORM\Tools\Pagination\CountWalker should create a count query that selects COUNT(*) instead of COUNT(tbl.id) when a query's HINT_DISTINCT is set/declared false. Both "counts" result in the same number being produced, however COUNT(*) allows some databases (e.g. Postgres) to finish the query faster. Please see the following query plans.

// COUNT(tbl.id) case
EXPLAIN SELECT count(o0_.id) AS sclr_0 FROM "order" o0_;

QUERY PLAN
Finalize Aggregate  (cost=40598.68..40598.69 rows=1 width=8)
  ->  Gather  (cost=40598.46..40598.67 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=39598.46..39598.47 rows=1 width=8)
              ->  Parallel Seq Scan on ""order"" o0_  (cost=0.00..38594.17 rows=redacted width=4)

---

// COUNT(*) case
EXPLAIN SELECT count(*) AS sclr_0 FROM "order" o0_;

QUERY PLAN
Finalize Aggregate  (cost=36422.71..36422.72 rows=1 width=8)
  ->  Gather  (cost=36422.49..36422.70 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=35422.49..35422.50 rows=1 width=8)
              ->  Parallel Index Only Scan using idx_f5299398f543e763 on ""order"" o0_  (cost=0.42..34418.20 rows=redacted width=0)

Notice how in the case of COUNT(*), Postgres is counting using an "Index Only Scan". The speed difference isn't substantial (at least for the number of rows I tested with), but it's also quite cheap to obtain (by just making the code use COUNT(*)).

Thank you.

@greg0ire
Copy link
Member

Do you have an explanation as to why the query plan differs from one query to the other, and are you sure that explanation will apply regardless of the database in use? I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference. But I agree that "Seq Scan" does not sound good.

@d-ph
Copy link
Author

d-ph commented Jul 17, 2024

Do you have an explanation as to why the query plan differs from one query to the other

I've done some research just now and the short answer appears to be that COUNT(*) gives more opportunities to the database to seek alternative ways to fulfil the query than when COUNT(tbl.id) is used.

My other observations:

  1. A lot of sources state that COUNT(column_name) tells the database to count not-null values of the column_name, while COUNT(*) tells the database to just count the rows, regardless of their contents. Postgres in particular is sometimes willing to use an index-only scan to fulfil the COUNT(*) query. The apparent non-obvious thing here is that COUNT(*) is an equivalent of COUNT(), but it's not COUNT() because that's a syntax error in SQL.
  2. Despite the above, I found that both COUNT() queries allow Postgres to attempt to use an index-only scan, however COUNT(*) allows to use any index available on the table in question, while COUNT(tbl.id) allows to use only the (usually primary) index on the id column. This for some reason is sometimes less eagerly used if there's been a period of time since the last vacuum analyze was carried out on the table in question, because when I vacuum analyze'd my test table, then Postgres started using an index-only scan also for the COUNT(tbl.id) query.
  3. The time difference between a "parallel seq scan" and a "parallel index-only scan" seems to be under 20ms for my test tables (in favour of the index-only scan). It makes me believe that Postgres is just accurately concluding that there will be no difference between the two, and just chooses to sometimes use a sequential scan in case of the COUNT(tbl.id).

are you sure that explanation will apply regardless of the database in use?

I'm not sure but it's very plausible. For what it's worth, chatgpt is under impression (and bullet-points some examples) that most of the databases benefit from being told to do COUNT(*) instead of COUNT(column_name), when the column_name is not nullable (and primary keys aren't supposed to be nullable).

I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference.

After going down the rabbit hole here, I'm also less inclined now to make this change. The only standing argument for that change is what I said at the top: COUNT(*) gives more opportunities for optimised querying than COUNT(tbl.id), however, as I later said, I wasn't able to confirm spectacular speed difference in my test data.

For this reason I'm fine to close this ticket now without any code change, unless you'd like (me) to make that change anyway due to the reason I explained above.

@greg0ire
Copy link
Member

Thanks for the very detailed explanation. From what you are telling, I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (I'm leaving aside the case of composite primary keys for the sake of simplicity). Is that what you observe ?

@d-ph
Copy link
Author

d-ph commented Jul 18, 2024

I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (...) Is that what you observe?

Interesting. You're absolutely right. When I add a simple where-condition on a column that is indexed, the index-only scan is never used in the case of COUNT(tbl.id) (regardless of whether the table is vacuum'd or not), but is always used when COUNT(*) is used. In this particular case, I can see 50-75ms speed difference in favour of COUNT(*). Whether that's significant is open to debate.

I'm leaving aside the case of composite primary keys for the sake of simplicity

The CountWalker already throws an exception when the table being queried uses a composite primary key. So we don't need to discuss that case here.


So yeah, still down to you if you wish to carry on with this code change. I'm somewhat more inclined to it now.


Edit: I just tried different where-conditions, and as one could imagine, if the where-condition filters out significant amount of the table, the speed difference is significant. E.g. 0.1s vs 0.001s.

@greg0ire
Copy link
Member

Great! Using Git, can you find out why we are not using (*) in the first place? Likewise, if you switch the code to (*)< do any tests break?

@d-ph
Copy link
Author

d-ph commented Jul 19, 2024

why we are not using (*) in the first place?

Git history shows that the code does what it does since the very beginning (2012) and it wasn't attempted to change it to conditionally doing COUNT(*).

if you switch the code to (*)< do any tests break?

I made that change and no tests failed. In particular, CountWalkerTest didn't fail because it never checks for the case when the HINT_DISTINCT is set to false.

I checked this code change in my webapp and it's effective (i.e. COUNT(*) is run conditionally), and I didn't experience any regressions.

@greg0ire
Copy link
Member

Then please send a PR against the next minor branch, I think I'm OK with this change 👍

@d-ph
Copy link
Author

d-ph commented Jul 19, 2024

image

Could you tell me whether you mean the 3.3.x or the 3.2.x branch, please?

It that's an option, I would backport it to 2.20.x and then "replay" the code change from 2.20.x to every branch above it.

@greg0ire
Copy link
Member

2.x is only open for bugfixes. This does not look like a bug at all. The next minor branch is 3.3.x (since the last tag is 3.2.1).

d-ph pushed a commit to d-ph/orm that referenced this issue Jul 22, 2024
…se (doctrine#11552)

This change makes CountWalker use COUNT(*) instead of
COUNT(tbl.id), when the user declared that their query
does not need to use (SELECT) DISTINCT, which is
commonly the case when there are no JOINs in the query,
or when the JOINs are only *ToOne.

Research showed that COUNT(*) allows databases to use
index(-only) scans more eagerly from any of the
indexed columns, especially when the query is using
a WHERE-condition that filters on an indexed column.
greg0ire added a commit that referenced this issue Aug 19, 2024
…nt-star-query-sometimes

Make CountWalker use COUNT(*) when $distinct is explicitly set to false (#11552)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants