Subqueries give strange results

The following query

SELECT Count(*)                     AS po_count,
       Date_format(orderdate, '%b') AS ordermonth,
       Year(orderdate)              AS orderyear
FROM   erp.purchase_orders po
GROUP  BY Year(orderdate),
          Month(orderdate),
          Date_format(orderdate, '%b')
ORDER  BY Year(orderdate),
          Month(orderdate)  

gives the results I expect, a grouping of every month, year, and a count of records.

po_count	ordermonth	orderyear
1	Jan	1970
4	Dec	2003
4	Jun	2005
3	Sep	2005
2	Nov	2005
2	Mar	2006
3	Apr	2006
7	Jul	2006
2	Sep	2006
46	Oct	2006
3	Nov	2006
14	Dec	2006
6	Feb	2007
7	May	2007
17	Jun	2007
8	Dec	2007
13	Feb	2008

The application I am writing requires that this base query can be further filtered by an user who will just pass in a filter clause like

where orderyear = 2020

My plan was to warp the above query as a subquery and the append the filter like this

SELECT *
FROM   (SELECT Count(*)                     AS po_count,
               Date_format(orderdate, '%b') AS ordermonth,
               Year(orderdate)              AS orderyear
        FROM   erp.purchase_orders po
        GROUP  BY Year(orderdate),
                  Month(orderdate),
                  Date_format(orderdate, '%b')
        ORDER  BY Year(orderdate),
                  Month(orderdate)) _metric
WHERE  _metric.orderyear = 2020  

The results however only return back two (what seem to be random) records

po_count	ordermonth	orderyear
11378	Dec	2020
11998	Feb	2020

If I put the filter directly into the query I get the result I would expect.

SELECT Count(*)                     AS po_count,
       Date_format(orderdate, '%b') AS ordermonth,
       Year(orderdate)              AS orderyear
FROM   erp.purchase_orders po
WHERE  Year(orderdate) = 2020
GROUP  BY Year(orderdate),
          Month(orderdate),
          Date_format(orderdate, '%b')
ORDER  BY Year(orderdate),
          Month(orderdate)  
po_count	ordermonth	orderyear
12858	Jan	2020
11998	Feb	2020
14765	Mar	2020
7952	Apr	2020
12123	May	2020
15015	Jun	2020
11876	Jul	2020
11818	Aug	2020
14210	Sep	2020
11875	Oct	2020
13317	Nov	2020
11378	Dec	2020

So how can I use traditional subqueries in Trino?

Thanks

Hey @soxprox, it seems a little redundant to include Month(orderdate) in the GROUP BY statement when you have Date_format(orderdate, '%b') already filtering by month. I’m curious to know what happens if you remove that from your GROUP BY statement.

SELECT *
FROM   (SELECT Count(*)                     AS po_count,
               Date_format(orderdate, '%b') AS ordermonth,
               Year(orderdate)              AS orderyear
        FROM   erp.purchase_orders po
        GROUP  BY Year(orderdate),
                  Date_format(orderdate, '%b')
        ORDER  BY Year(orderdate),
                  Month(orderdate)) _metric
WHERE  _metric.orderyear = 2020

Not saying that is the solution because it should have worked either way, but I wonder if it has some effect on the behavior of the nested query.

That should just work. Can you share the result of EXPLAIN for the two queries?
Also, what version of Trino are you running?

@martin, Thanks for getting back to me, here are the EXPLAINS.

Trino Version 363With SubQuery

Fragment 0 [SINGLE]
    Output layout: [count, date_format$gid, year$gid]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Output[po_count, ordermonth, orderyear]
    │  Layout: [count:bigint, date_format$gid:varchar, year$gid:bigint]
    │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
    │  po_count := count
    │  ordermonth := date_format$gid
    │  orderyear := year$gid
    └─ RemoteSource[1]
           Layout: [year$gid:bigint, date_format$gid:varchar, count:bigint]
 
Fragment 1 [HASH]
    Output layout: [year$gid, date_format$gid, count]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Project[]
    │  Layout: [year$gid:bigint, date_format$gid:varchar, count:bigint]
    │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
    └─ Aggregate(FINAL)[year$gid, month$gid, date_format$gid][$hashvalue]
       │  Layout: [year$gid:bigint, month$gid:bigint, date_format$gid:varchar, $hashvalue:bigint, count:bigint]
       │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
       │  count := count("count_0")
       └─ LocalExchange[HASH][$hashvalue] ("year$gid", "month$gid", "date_format$gid")
          │  Layout: [year$gid:bigint, month$gid:bigint, date_format$gid:varchar, count_0:bigint, $hashvalue:bigint]
          │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
          └─ RemoteSource[2]
                 Layout: [year$gid:bigint, month$gid:bigint, date_format$gid:varchar, count_0:bigint, $hashvalue_1:bigint]
 
Fragment 2 [SOURCE]
    Output layout: [year$gid, month$gid, date_format$gid, count_0, $hashvalue_2]
    Output partitioning: HASH [year$gid, month$gid, date_format$gid][$hashvalue_2]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Aggregate(PARTIAL)[year$gid, month$gid, date_format$gid][$hashvalue_2]
    │  Layout: [year$gid:bigint, month$gid:bigint, date_format$gid:varchar, $hashvalue_2:bigint, count_0:bigint]
    │  count_0 := count(*)
    └─ Project[]
       │  Layout: [year$gid:bigint, month$gid:bigint, date_format$gid:varchar, $hashvalue_2:bigint]
       │  Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
       │  $hashvalue_2 := combine_hash(combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("year$gid"), 0)), C
       └─ ScanFilterProject[table = erp.purchase_orders Erp.PURCHASE_ORDERS columns=[OrderDate
              Layout: [year$gid:bigint, month$gid:bigint, date_format$gid:varchar]
              Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ?
              year$gid := year("orderdate")
              month$gid := month("orderdate")
              date_format$gid := date_format("orderdate", '%b')
              orderdate := OrderDate:timestamp(0):datetime2 Without SubQuery

Fragment 0 [SINGLE]
    Output layout: [count, date_format$gid, year$gid]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Output[po_count, ordermonth, orderyear]
    │  Layout: [count:bigint, date_format$gid:varchar, year$gid:bigint]
    │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
    │  po_count := count
    │  ordermonth := date_format$gid
    │  orderyear := year$gid
    └─ Project[]
       │  Layout: [date_format$gid:varchar, year$gid:bigint, count:bigint]
       │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
       └─ RemoteMerge[1]
              Layout: [date_format$gid:varchar, year$gid:bigint, month$gid:bigint, count:bigint]
 
Fragment 1 [ROUND_ROBIN]
    Output layout: [date_format$gid, year$gid, month$gid, count]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    LocalMerge[year$gid ASC NULLS LAST, month$gid ASC NULLS LAST]
    │  Layout: [date_format$gid:varchar, year$gid:bigint, month$gid:bigint, count:bigint]
    │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
    └─ PartialSort[year$gid ASC NULLS LAST, month$gid ASC NULLS LAST]
       │  Layout: [date_format$gid:varchar, year$gid:bigint, month$gid:bigint, count:bigint]
       └─ RemoteSource[2]
              Layout: [date_format$gid:varchar, year$gid:bigint, month$gid:bigint, count:bigint]
 
Fragment 2 [HASH]
    Output layout: [date_format$gid, year$gid, month$gid, count]
    Output partitioning: ROUND_ROBIN []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Project[]
    │  Layout: [date_format$gid:varchar, year$gid:bigint, month$gid:bigint, count:bigint]
    │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
    └─ Aggregate(FINAL)[date_format$gid, year$gid, month$gid][$hashvalue]
       │  Layout: [date_format$gid:varchar, year$gid:bigint, month$gid:bigint, $hashvalue:bigint, count:bigint]
       │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
       │  count := count("count_0")
       └─ LocalExchange[HASH][$hashvalue] ("date_format$gid", "year$gid", "month$gid")
          │  Layout: [date_format$gid:varchar, year$gid:bigint, month$gid:bigint, count_0:bigint, $hashvalue:bigint]
          │  Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
          └─ RemoteSource[3]
                 Layout: [date_format$gid:varchar, year$gid:bigint, month$gid:bigint, count_0:bigint, $hashvalue_1:bigint]
 
Fragment 3 [SOURCE]
    Output layout: [date_format$gid, year$gid, month$gid, count_0, $hashvalue_2]
    Output partitioning: HASH [date_format$gid, year$gid, month$gid][$hashvalue_2]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Aggregate(PARTIAL)[date_format$gid, year$gid, month$gid][$hashvalue_2]
    │  Layout: [date_format$gid:varchar, year$gid:bigint, month$gid:bigint, $hashvalue_2:bigint, count_0:bigint]
    │  count_0 := count(*)
    └─ Project[]
       │  Layout: [year$gid:bigint, month$gid:bigint, date_format$gid:varchar, $hashvalue_2:bigint]
       │  Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
       │  $hashvalue_2 := combine_hash(combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("date_format$gid"),
       └─ ScanFilterProject[table = erp.purchase_orders Erp.PURCHASE_ORDERS columns=[OrderDate
              Layout: [year$gid:bigint, month$gid:bigint, date_format$gid:varchar]
              Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ?
              year$gid := year("orderdate")
              month$gid := month("orderdate")
              date_format$gid := date_format("orderdate", '%b')
              orderdate := OrderDate:timestamp(0):datetime2

@martin, Further testing shows that each time I run the same query (using a subquery) I get a different number of results. Sometimes only a single result, sometimes more.

@martin, I think I have found the issue. We are using the Trino API to run queries. I think the logic is in the way we are processing the API responses. If run the subquery in the Trino CLI, it works as expected.

Thats good to know. It might be worth changing your setup to run the query with the Trino CLI, the JDBC driver or the Python client, just not directly against the API.