Skip to content

New Window Functions syntax. DistinctBy translation support. #4844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

sdanyliv
Copy link
Member

@sdanyliv sdanyliv commented Feb 23, 2025

Implemented Features

  • Added support for DistinctBy.
  • Implemented several window functions for evaluating proposed syntax.

Current Implementation (preliminary, subject to changes, frame boundaries not yet implemented)

var query =
    from t in db.SomeTable
    let wnd = Sql.Window.DefineWindow(f => f.PartitionBy(t.Field1).OrderBy(t.Field2))
    select new
    {
        RN1 = Sql.Window.RowNumber(f => f.PartitionBy(t.Field1).OrderBy(t.Field2)),
        RN2 = Sql.Window.RowNumber(f => f.UseWindow(wnd)),

        Sum1 = Sql.Window.Sum(t.Field, f => f.PartitionBy(t.Field1).OrderBy(t.Field2)),
        Sum2 = Sql.Window.Sum(t.Field, f => f.UseWindow(wnd)),
        SumFiltered = Sql.Window.Sum(t.Field, f => f.Filter(t.Field > 10).PartitionBy(t.Field1).OrderBy(t.Field2))
    };

Aggregate Functions Examples

var query =
    from t in db.SomeTable
    group t by t.Category into g
    select new
    {
        PercentileCont = g.PercentileCont(0.5, (e, f) => f.OrderBy(e.Field1))
    };
var aggregated = db.SomeTable.PercentileCont(0.5, (e, f) => f.OrderBy(e.Field1));

Open Questions

Specifying Arguments

COUNT

SQL Syntax Proposed C# Syntax
COUNT() OVER(...) Sql.Window.Count(f => f.PartitionBy(x.Some))
COUNT(x.Some) OVER(...) Sql.Window.Count(x.Some, f => f.PartitionBy(x.Some))
COUNT(DISTINCT x.Some) OVER(...) Possible variations include:
1. Sql.Window.CountDistinct(x.Some, f => f.PartitionBy(x.Some))
2. Sql.Window.Count(f => f.DistinctBy(x.Some).PartitionBy(x.Some))
3. Sql.Window.Count(a => a.Distinct(x.Some), f => f.PartitionBy(x.Some))

SUM, AVG, etc.

SQL Syntax Proposed C# Syntax
SUM(x.Some) OVER(...) Sql.Window.Sum(x.Some, f => f.PartitionBy(x.Some))
SUM(DISTINCT x.Some) OVER(...) Possible variations include:
1. Sql.Window.SumDistinct(x.Some, f => f.PartitionBy(x.Some)) (Note: multiple overloads required)
2. Sql.Window.Sum(a => a.Distinct(x.Some), f => f.PartitionBy(x.Some))

Recommended syntax (preferred for clarity and extensibility):

Sql.Window.Count(x.Some, f => f.PartitionBy(x.Some))
Sql.Window.Count(a => a.Distinct(x.Some), f => f.PartitionBy(x.Some))

Sql.Window.Sum(x.Some, f => f.PartitionBy(x.Some))
Sql.Window.Sum(a => a.Distinct(x.Some), f => f.PartitionBy(x.Some))

Specifying Frames

SQL Frame Syntax:

[ROWS | RANGE | GROUPS] frame_start [frame_exclusion]
[ROWS | RANGE | GROUPS] BETWEEN frame_start AND frame_end [frame_exclusion]

frame_exclusion:
    EXCLUDE CURRENT ROW
    EXCLUDE GROUP
    EXCLUDE TIES
    EXCLUDE NO OTHERS

SQL Examples:

-- Rows Example
SUM(x.Some) OVER (
    ORDER BY x.Some
    ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
)

-- Range Example
SUM(x.Some) OVER (
    ORDER BY x.Some
    ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING
)

Current Proposed C# Syntax:

Currently proposed (open to discussion):

Sql.Window.Sum(x.Some, f => f.OrderBy(x.Some).RowsBetween.Value(1).And.Current)
Sql.Window.Sum(x.Some, f => f.OrderBy(x.Some).RangeBetween.Value(1).And.Unbounded)

Alternate possibilities:

// Alternative simplified syntax
Sql.Window.Sum(x.Some, f => f.OrderBy(x.Some).RowsBetween.Value(1).Current)
Sql.Window.Sum(x.Some, f => f.OrderBy(x.Some).RangeBetween.Value(1).Unbounded)

// More verbose but explicit syntax
Sql.Window.Sum(x.Some, f => f.OrderBy(x.Some).RowsBetween(b => b.Value(1), b => b.Current))
Sql.Window.Sum(x.Some, f => f.OrderBy(x.Some).RangeBetween(b => b.Value(1), b => b.Current))

Seeking Feedback

Please share your thoughts on the proposed syntaxes and indicate your preference, especially considering clarity, consistency, and extensibility.

@jods4
Copy link
Contributor

jods4 commented Mar 3, 2025

@linq2db/developers As we're introducing new apis for this complex and often-used feature, I think this deserves a thorough discussion with the whole team to try and land the best DX possible. This is a complex API surface for sure, there could be many different ideas.

It's also good to design in a way where we're able to cover the complete SQL standard regarding windows, so I'll start by summarizing Postgre docs on the topic (as Postgre is probably the most standard-compliant DB that I know of).

First a quick note: window functions apply to aggregate-like functions, not aggregates.
They have a lot in common but the key differences IMHO is that aggregates have ALL | DISTINCT support and built-in ORDER BY (incl. WITHIN GROUP stuff).
Here's the Aggregates syntax from Postgre:

aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) [ FILTER ( WHERE filter_clause ) ]

I think we should keep them in our mind but they could also have a different "entry" api?

Now here's the Postgre syntax for window functions. First, what's an aggregate-like function?

function_name ([expression [, expression ... ]]) [ FILTER ( WHERE filter_clause ) ] OVER window_name
function_name ([expression [, expression ... ]]) [ FILTER ( WHERE filter_clause ) ] OVER ( window_definition )
function_name ( * ) [ FILTER ( WHERE filter_clause ) ] OVER window_name
function_name ( * ) [ FILTER ( WHERE filter_clause ) ] OVER ( window_definition )
  • It's a function with zero (* such as count(*)), one, or several (e.g. string_agg separator) parameters;
  • It can have a FILTER applied (so do aggregates).
  • It's applied OVER a window: either inline (window_definition) or that references a previously named window window_name.

The ability to reference a window_name has a big impact in that we need to be able to define a window without a function, and refer to it, but syntax-wise it's the same. It translates into a clause WINDOW window_name AS ( window_definition ) in the statement.

Here's the window_definition syntax (same inline or as WINDOW):

[ existing_window_name ]
[ PARTITION BY expression [, ...] ]
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
[ frame_clause ]
  • existing_window_name can be used to extend a named WINDOW definition, e.g. change it's ORDER BY, PARTITION BY or frame. I am not familiar with this, my understanding is that whatever you specify fully replaces the equivalent clause in the existing window. E.g. WINDOW w1 as (PARTITION BY department ORDER BY salary DESC), w2 as (w1 ORDER BY salary ASC).
  • PARTITION BY is just a list of expressions to use as partition keys
  • ORDER BY is our old friend. It'd be good to check if we can unify ORDER BYs from other places, to have consistent NULLS FIRST support for example. I'm not familiar with the USING operator clause, I assume you can define custom comparisons in Postgre?

Finally frame_clause indicates which siblings rows are part of the aggregate for each selected row:

{ RANGE | ROWS | GROUPS } frame_start [ frame_exclusion ]
{ RANGE | ROWS | GROUPS } BETWEEN frame_start AND frame_end [ frame_exclusion ]
  • Three modes: a RANGE of actual values, a number of ROWS or a number of sibling GROUPS, where a group is defined by ORDER BY equivalence (so GROUPS can't be used without ORDER BY).
  • A start and optionnally an end.
  • A clause to exclude specific rows.

Frames are:

UNBOUNDED PRECEDING
offset PRECEDING
CURRENT ROW
offset FOLLOWING
UNBOUNDED FOLLOWING

Note that offset is usually an integer, but when the frame is defined as RANGE it can be other types, e.g. an INTERVAL if the window is ordered on a DATE (e.g. RANGE '7 days' PRECEDING).

Exclusion are:

EXCLUDE CURRENT ROW
EXCLUDE GROUP
EXCLUDE TIES
EXCLUDE NO OTHERS

@jods4
Copy link
Contributor

jods4 commented Mar 3, 2025

@sdanyliv You'll have to let me know what is possible with the new parser or not.
Here I'm really thinking only as an end-user, what syntax I would ideally like best.

Here's the current PR syntax from a random test case:

Sql.Window.Sum(t.IntValue, w => w.PartitionBy(t.CategoryId).OrderBy(t.Id))

I am not a big fan of nesting the window definition inside the aggregate function, nor using lots of lambdas, esp. in a LINQ expression where I think they wouldn't be required?

Here's one idea: what if we started with the window and ended with the aggregate?
I know it doesn't follow exactly the SQL order, but as much as I like have linq2db look like a literal SQL DSL, sometimes a few changes just make it more usable from C#...

from t in table
// Can we use fluent api without lambdas to define the window?
let w = Sql.Window("optional_name").PartitionBy(t.Country)
select new
{
  // Can we end with aggregate?
  // This works naturally for existing windows:
  sumByCountry = w.Sum(t.Salary),
  // Also work to modify a named window?
  sumByAge = w.PartitionBy(t.Age).Sum(t.Salary),
  // Or define a new window
  localWindowSum = Sql.Window.PartitionBy(t.Country).Sum(t.Salary),
}

What do you guys think?

@igor-tkachev
Copy link
Member

An issue with the proposed Sql.Window() is that it conflicts with the existing Sql.Window property.
Instead of Sql.Window(), we could use something like Sql.Window.Define(), Sql.Window.Create(), or Sql.Window.New().

@sdanyliv
Copy link
Member Author

sdanyliv commented Mar 4, 2025

@jods4, try applying your syntax to small test classes first. The issue here is that it doesn’t guarantee a correct window function definition. Some functions require a partition, while others require OrderBy. For instance, PercentileCont mandates a single OrderBy, and the argument of OrderBy determines the function’s return type.

I’m also not a big fan of using a nested lambda inside a window function, but in this case, it’s the most capable approach since the builder is strictly constrained to a specific window function.

Keep in mind that lambda syntax greatly simplifies translation. Once we identify the RowNumber function, everything is encapsulated within its definition.

Regarding window definition, I might be mistaken, but I don’t see much benefit in defining a window the way you suggested. Check the RowNumberWithMultiplePartitionsWithDefineWindow test—it demonstrates how I define such a window. There’s no name assigned yet, and it may not even be necessary. In fact, we could automatically detect duplicate window definitions and incorporate them into the query if it significantly improves SQL performance.

@jods4
Copy link
Contributor

jods4 commented Mar 4, 2025

@igor-tkachev

An issue with the proposed Sql.Window() is that it conflicts with the existing Sql.Window property.

Wasn't it introduced for this release? I just looked in 5.1.1 completion and I don't see it.
Anyway we can adjust, e.g. we can start off of .Window property as you suggested, it's not a blocker.

@sdanyliv

The issue here is that it doesn’t guarantee a correct window function definition.

This is a good first design question. Is it an explicit goal to enforce 100% valid syntax?
In general I agree it's a good thing to have, but if it turns out to be very tricky, I think we can shift some blame for writing valid SQL to users and they have to test.

But let's talk about some concrete examples, as it's interesting. You mention PERCENTILE_CONT. This is part of ordered set aggregate functions and their syntax is actually different.
As you mention they take a single order in WITHIN GROUP (ORDER BY ..), not inside OVER (..).

In fact in Postgre, these functions DO NOT support windows! It seems to be the simple approach for these would be a dedicated two args function:

// T PercentileCont<T>(double fraction, T withinGroupOrderBy)
Sql.PercentileCont(0.34, table.Salary)

Things get more fun if you look at other DB such as MSSQL. PERCENTILE_CONT is very similar to Postgre (forced usage of WITHIN GROUP) _but MSSQL accepts a window, containing only a PARTITION BY:

-- MSSQL syntax
PERCENTILE_CONT ( numeric_literal )
    WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] )
    OVER ( [ <partition_by_clause> ] )

Per-provider syntax is not enforceable in our API. For the MSSQL extension, I personally would suggest re-using the same PercentileCont signature as above at the end of a window:

Window.PartitionBy(table.Department).PercentileCont(0.34, table.Salary)

Now this is not perfect:

  • It would fail on RDBMS where this is not supported (e.g. Postgre). I don't think there's anything we can do about it.
  • It would fail on MS SQL if you add invalid window clauses, such as referring to a named window (this would always fail anyway in MSSQL < 2022) or if you add ORDER BY. Quite frankly, I would be ok with this: you get the query you asked for, and it didn't make sense.

I would need to lookup how the standard describes these functions. Maybe the PARTITION BY clause is not even a proper "window" per se, in which case I would suggest an overload with additional parameters rather than using Window here.

@sdanyliv can you think of other significant cases where using the window first would allow invalid SQL, that are significant enough that we shouldn't allow them?

Regarding window definition, I might be mistaken, but I don’t see much benefit in defining a window the way you suggested.

Are you referring to the name usage only, or something else?
This was an idea to provide an optional name in generated SQL, I think we have a similar feature for CTE?

// WINDOW w as (PARTITION BY t.Department)
let w = Window.PartitionBy(t.Department)

// WINDOW dept as (PARTITION BY t.Department)
let w = Window.WithName("dept").PartitionBy(t.Department)

I don't feel strongly about this, we can always add later or not at all. As I said, I was mostly inspired by similar CTE option.

@jods4
Copy link
Contributor

jods4 commented Mar 4, 2025

I did a quick search on PERCENTILE_CONT. Postgre is the outlier here.
MSSQL, Redshift, Snowflake, Oracle, MariaDB all define this function as both an aggregate and a window function -- and they all use the exact same syntax: when it's used as a window function only PARTITION BY seems supported by RDBMS.

If we really wanted to enforce this at syntax level (personally not convinced) we could do it with "clever" extensions methods.
If we Window is a IUnorderedWindow : IWindow and becomes IOrderedWindow : IWindow after .OrderBy(), then PercentileCont can be an extension method on IUnorderedWindow only.

Likewise frames (RANGE | ROWS | GROUP) only make sense on ordered window so could be extensions of IOrderedWindow.

It should be noted that if we want to go to the very end of this, RANGE is only valid when there's a single ORDER BY expression, but this is something that can be done as well. In fact, because of how ThenBy can only be proposed after an OrderBy we already need some differentiation in static typing based on number of columns ordered by. And if we want to type the range strictly, the type of that first expression must be reflected in that static type.

@sdanyliv
Copy link
Member Author

sdanyliv commented Mar 4, 2025

@jods4, I'm open to your API proposal—mainly consisting of extension methods and interfaces—which would require minimal rework on my end.

  1. Window functions should reuse a predefined window.
  2. They should return the correct type.
  3. The compiler should prevent defining an invalid window unless a predefined one is used.

If you can implement this for several complex window functions, I'll review and compare. This part is quite tricky...

Leave PercentileCont behind, It is realy strange function and you are wrong with syntax, it is aggregation function with OVER PART and actually it perfectly fits to proposed API. Check sample in first PR comment.

Why I proposed the new syntax:

  1. Easier to translate – I don’t have to search for all possible PartitionBy, OrderBy, or ToValue calls and then trace back to the window function. While it extends translation significantly, it becomes feasible with comprehensive tests.
  2. Leverages standard translation – I used the standard IMemberTranslator, making window function translation straightforward.
  3. Reliable window function closure – The chances of missing the function terminator (ToValue()) are nearly eliminated. Forgetting it isn’t critical but still important.
  4. Compiler enforcement – The compiler will prevent compilation if a function is missing a required partition, ordering, or other necessary components.
  5. Simpler and more structured – Compare these approaches:
    • RN1 = Sql.Window.RowNumber(f => f.PartitionBy(t.Field1).OrderBy(t.Field2)),
    • RN1 = Sql.Ext.RowNumber().Over().PartitionBy(t.Field1).OrderBy(t.Field2).ToValue(),

@jods4
Copy link
Contributor

jods4 commented Mar 4, 2025

@sdanyliv I like this initiative and I agree with you the current syntax is clumsy.
I especially don't like the ToValue() call either, which is why I had the idea of putting the function at the end.

I think my proposal is not perfect either, I'm just trying to put out as many ideas as possible so we can pick the best one before publishing them.

Leave PercentileCont behind, It is realy strange function and you are wrong with syntax,

What do you mean I'm wrong with syntax? I have check 6 different DB docs, the following was quoted straight from MS documentation. With exception of Postgre, which doesn't support OVER at all, all RDBMS uses WITHIN GROUP instead of OVER for the ORDER BY:

PERCENTILE_CONT ( numeric_literal )
    WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] )
    OVER ( [ <partition_by_clause> ] )

Compared approaches from point 5:

// Current linq2db, I think this is worse option
RN1 = Sql.Ext.RowNumber().Over().PartitionBy(t.Field1).OrderBy(t.Field2).ToValue(),
// Current PR, using lambda inside analytical function
RN1 = Sql.Window.RowNumber(f => f.PartitionBy(t.Field1).OrderBy(t.Field2)),
// What I had proposed in comment above:
RN1 = Sql.Window.PartitionBy(t.Field1).OrderBy(t.Field2).RowNumber()

On this example, the PR and my proposal are very similar, I personally like the 3rd option better that 2nd because there's no nesting or lambda.

Can you tell me what you think are "hard" examples so that I can try to see if they could fit in a different proposal?

For ordered set aggregates mode, percentile_cont, percentile_disc, I would handle them as above: with 2 mandatory arguments (fraction + order expression). If the provider supports using them as windowed functions with an OVER (partition) I would chain them after a window.

Note that some DB (e.g. Postgre and Oracle) also have very weird Hypothetical-Set Aggregates like rank(args) within group (order by sorted_args). Althought these are never window functions, their treatment of order by would be off and inconsistent with what I proposed for percentile_cont (because they have multiple values and should support nulls first etc. 😞

@sdanyliv
Copy link
Member Author

sdanyliv commented Mar 4, 2025

RN1 = Sql.Window.PartitionBy(t.Field1).OrderBy(t.Field2).RowNumber()

This one is bad idea ever :)

When you want RowNumber, you type RowNumber then think about window.

Fail example, with filter:
Sql.Window.Filter(t.Some < 10).PartitionBy(t.Field1).OrderBy(t.Field2).... oops RowNumber is not possible because of Filter.

@igor-tkachev
Copy link
Member

igor-tkachev commented Mar 14, 2025

I suggest adding another option:

Sql.Window.Count().Distinct(x.Some).Over(f => f.PartitionBy(x.Some))
Sql.Window.Count().Over(f => f.PartitionBy(x.Some))

This would be more consistent with:

Sql.Window.Sum().Distinct(x.Some).Over(f => f.PartitionBy(x.Some))
Sql.Window.Sum().Over(f => f.PartitionBy(x.Some))

and other similar patterns.

@jods4
Copy link
Contributor

jods4 commented Mar 14, 2025

@igor-tkachev
How would you address the following two points with your proposal?

(1) How is your expression typed as its value?
Does .Over() return a type appropriate for the target function?
If so, does it mean that the exact same syntax cannot be used when those functions are used as aggregates (no OVER clause)?

(2) Can you provide an example for more analytical functions, do you think the syntax scales nicely?
For instance, how would
string_agg(DISTINCT name, ',' ORDER BY name NULLS LAST) FILTER (age > 18)
look in C#, in a way that's somewhat re-usable / consistent across window and aggregate functions?

@jods4
Copy link
Contributor

jods4 commented Mar 14, 2025

@sdanyliv With the syntax currently proposed in this PR, what's your take on differentiating ORDER BY from inside OVER vs WITHIN GROUP vs inside function? It seems to me you want to use an .OrderBy() call for all these situations but they are actually slightly different.

This is a bit hypothetical at the moment because I don't know any DB engine that supports it, but the following does make sense:

string_agg(name, ',' ORDER BY age) OVER (
  PARTITION BY dept 
  ORDER BY code 
  GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)

This extremely convoluted but it has sense:
GROUPS BETWEEN modifies what rows belong to the window, based on ORDER BY code result.
Then ORDER BY age says how these rows must be sorted when they are concatenated by string_agg.

If you try that with Postgre it doesn't say this is invalid. It says: "aggregate ORDER BY is not implemented for window functions" (so you can't even use string_agg with OVER(PARTITION BY)).
I don't know of any DB that supports this today, but it could be supported in the future (MSSQL already supports STRING_AGG OVER() but only with PARTITION BY, no ORDER).

A similar logic/argument exists for functions that use WITHIN GROUP ORDER BY.

So I think it'd be nice to reflect this structure in our API, although I also don't want it to become overly verbose.

Proposal: should we add .Over(), always at the end, and precede the window definition (so it's clear we're not WITHIN GROUP)?

Count(x => x.Over(declaredWindow));
Sum(cost, x => x.Over.PartitionBy(color).OrderBy(date));
StringAgg(name, ", ", x => x.OrderBy(age).Over.OrderBy(dept).Groups(1, 1))

@sdanyliv
Copy link
Member Author

sdanyliv commented Mar 15, 2025

@igor-tkachev

I suggest adding another option:

Sql.Window.Count().Distinct(x.Some).Over(f => f.PartitionBy(x.Some))
Sql.Window.Count().Over(f => f.PartitionBy(x.Some))

That's why we rewrite window functions, they have almost similar syntax:
Let's brake this down from compiler perspective

var query = 
    from t in db.Some
    select new 
    {
         a = Sql.Window.Count(), // invalid but compilable
         b = Sql.Window.Count().Distinct(x.Some), // invalid but compilable
         valid = Sql.Window.Count().Distinct(x.Some) .Over(f => f.PartitionBy(x.Some)) 
    }

Pros:

  1. Simple for reading and maybe typing

Const:

  1. Function can be divided to invalid parts
  2. Count of Interfaces, which should be introduced grows significantly and it is simplest function.
  3. Parsing is complicated, We should start translation from Over. then collect everything up to Sql.Window.Count()
  4. Well... many hidden problems which users may introduce..

@sdanyliv
Copy link
Member Author

@jods4, good point! I'm starting to think that the Sql.Window prefix might be too weak. What do you think about Sql.Functions instead?

I also changed .Over() to a method instead of a property for consistency:

Sql.Functions.Count(x => x.Over(declaredWindow));  
Sql.Functions.Sum(cost, x => x.Over().PartitionBy(color).OrderBy(date));

Regarding aggregates, you missed how they should be defined—as an extension over IEnumerable:

var query =  
    from t in db.Some  
    group t by t.SomeId into g  
    select new  
    {  
        g.StringAgg(e => e.name, ", ", (e, x) => x.OrderBy(e.age).Over().OrderBy(e.dept).Groups(1, 1))  
    };

There are still open questions about CountDistinct and frame definition. .Groups(1, 1) is the simplest approach.

@jods4
Copy link
Contributor

jods4 commented Mar 16, 2025

@sdanyliv maybe I'm crazy but I'd like to circle back to my initial idea (put window/over first, like groupby) with a mix of your options builder.

I'm going to put the window upfront (everything inside OVER()).
Everything that is either inside the function (such as ALL|DISTINCT in aggregates and ORDER BY in string_agg) or before OVER (i.e. WITHIN GROUP and FILTER), I'm going to put inside the function builder, too.

// First we'll have a Window class that extends IEnumerable
sealed class Window<T> extends IEnumerable<T>;

// A window can be defined with a builder. Here's a WINDOW clause:
from p in People
let wnd = Sql.Window(w => w.PartitionBy(p.Gender).OrderBy(p.Age))

// Aggregates are extensions of IEnumerable<T>
// They are configured with builder, too
from g in db.People.GroupBy(p => p.Age)
select new 
{
  Count = g.Count(), // basic usage: COUNT(*)
  Count = g.Count(e => e.Nationality, (a, _) => a.Distinct()), // COUNT(DISTINCT nationality)
  FilteredMax = g.Max(e => e.Siblings, (a, e) => a.Filter(e.Gender == 'F')) // MAX(siblings) FILTER (WHERE gender = 'F')
}

// Aggregates can be called on IQueryable through `Aggregate[Async]()`  (name up to bikeshed), 
// and for most useful cases we can define an shortcut method on IQueryable directly
db.People.AggregateAsync(e => e.Count(o => o.Distinct()));  // Any aggregate can be immediately executed
db.People.CountAsync(o => o.Distinct()); // Nice-to-have shortcut for count.

// Because Window is IEnumerable, aggregates can naturally be called on windows.
// Although I can't figure how to make Window generic in this context, so it still has to be a different overload :(
// The upside is that less lambdas are required.
from p in db.People
let wnd = Sql.Window() // OVER (): everything in one partition
select wnd.Max(p.Age, a => a.Filter(p.Gender == 'M'))  // MAX(age) FILTER (WHERE gender = 'M') OVER (wnd)

// Windows can also be declared on the spot or extended 
from p in db.People
select new 
{
  oldest = Sql.Window(w => w.PartitionBy(p.Gender)).Max(p.Age),  // MAX(age) OVER (PARTITION BY gender)
  youngest = Sql.Window(wnd, w => w.PartitionBy(p.Gender)).Min(p.Age),  // MIN(age) OVER (wnd PARTITION BY gender)
}

// Windowed functions are defined as extensions of Window, so they can't be used without one:
from p in db.People
select Sql.Window(w => w.PartitionBy(p.Gender).OrderBy(p.Name, Nulls.First))
  .Rank()  // RANK() OVER (PARTITION BY gender ORDER BY name NULLS FIRST)

// Finally, my hypothetical string_agg example that no DB supports today:
Sql.Window(w => w.OrderBy(p.birthday.Month).Groups(1, 1))
  .StringAgg(p.name, ",", o => o.OrderBy(p.name))
  // STRING_AGG(name, ',' ORDER BY name) 
  // OVER (ORDER BY month(birthday) GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

@jods4
Copy link
Contributor

jods4 commented Mar 16, 2025

Okay small tweak on my previous idea that I think I like better:

// Use `Over()` to be closer to actual SQL code, make it fluent
select Sql.Over()
  .PartitionBy(x.Country)
  .OrderBy(x.Date)
  .Sum(x.Amount, o => o.Filter(x.Amount > 0))
  
// Declared windows can be fluent too (less lambdas) and use Window() to remain closer to SQL
let wnd = Sql.Window().PartitionBy(x.Color)  // WINDOW wnd AS (PARTITION BY color)
// Basic usage doesn't really require `Over()` but I think I like it.
// The main issue _without_ `Over()` is that it allows calling `wnd` fluent builder and that's ambiguous here,
// esp. considering the next example.
select Sql.Over(wnd).Max(x.Price) // MAX(price) OVER (wnd) 

// Windows can be modified when they're used:
select Sql.Over(wnd)
  .PartitionBy(x.Year)
  .Max(x.Price)  // MAX(price) OVER (wnd PARTITION BY year)
  
// Window declarations can be based off previous windows
let wnd2 = Sql.Window(wnd).OrderBy(x.Age) // WINDOW wnd2 AS (wnd ORDER BY age)

@sdanyliv
Copy link
Member Author

@jods4,

Summary:

Comparison of Syntax:

Sql.Over()
  .PartitionBy(x.Country)
  .OrderBy(x.Date)
  .Sum(x.Amount)

vs

Sql.Functions.Sum(x.Amount, f => f
  .Over()
  .PartitionBy(x.Country)
  .OrderBy(x.Date))

Sql.Over()
  .PartitionBy(x.Country)
  .OrderBy(x.Date)
  .Sum(x.Amount, o => o.Distinct().Filter(x.Amount > 0))

vs

Sql.Functions.Sum(x.Amount, f => f.Distinct().Filter(x.Amount > 0)
  .Over()
  .PartitionBy(x.Country)
  .OrderBy(x.Date))

Aggregates:

Currently, only I have defined them:

var query = 
    from t in db.Some
    group t by 1
    select g.StringAggregate(x => x.Name, (x, f) => f.OrderBy(x.Name).Over().PartitionBy(x.Department))

Your Syntax:

Advantages:

  1. No need for a dedicated DefineWindow function.
  2. In most cases (90%), window functions do not require a lambda.

Disadvantages:

  1. Translation becomes slightly more complex, though still feasible.
  2. Function name appears at the end, which may affect readability.
  3. Window functions have a distinct syntax compared to other functions.
  4. Part of the window definition is outside the function, while some parts are inside a lambda, requiring extra effort.
  5. Builder interfaces leak into the LINQ query.
  6. Defining a frame boundary results in invalid syntax for ROW_NUMBER.

Proposed Alternative:

Advantages:

  1. Provides a unified syntax for defining both window and non-window functions, ensuring consistency even for aggregates.
  2. Maintains a natural order of function definitions.
  3. Simplifies translation and error handling.
  4. Stronger compiler control over function definitions.
  5. No leakage of builder interfaces.

Disadvantages:

  1. Slightly more verbose due to the use of lambdas and parentheses.
  2. Requires DefineWindow for reusable window definitions, though this could be beneficial if named definitions are needed.
  3. ..... please define

I will define later list of API examples for all window functions and we can choose better syntax for definition.

@jods4
Copy link
Contributor

jods4 commented Mar 17, 2025

@sdanyliv some thoughts on pros/cons:

Re. my proposal I would add advantages:

(3) Aggregate lambda is identical whether used on a groupby or on a window.
E.g.: g.Count(Func<O>) and wnd.Count(Func<O>) -> same type O, no need to add Over() apis in second call.

(4) Concept of window (which set of rows are passed to function) is clearly separated from configuration of function (receiver) itself.

(5) Function name at the end is consistent with groupby, and C#/LINQ in general (which is the opposite of your disadvantage 2, not quite agree on this one).

(6) All windowed functions can be discovered as extensions of Window, incl. provider-specific or user-defined ones.

(7) Formatting tools such as CSharpier will format better:

// Your last example as formatted by Prettier:
// My proposal
Sql.Over()
  .PartitionBy(x.Country)
  .OrderBy(x.Date)
  .Sum(x.Amount, o => o.Distinct().Filter(x.Amount > 0))

// This PR
Sql.Functions.Sum(
  x.Amount, 
  f => 
    f.Distinct()
      .Filter(x.Amount > 0)
      .Over()
      .PartitionBy(x.Country)
      .OrderBy(x.Date)
)

Re. disadvantages of my proposal:

  1. Function name appears at the end, which may affect readability.

Debatable, I don't quite agree: see advantage 5 above (consistency with groupby / LINQ / C# in general).

  1. Window functions have a distinct syntax compared to other functions.

Can you clarify what you mean? I don't get it.

  1. Part of the window definition is outside the function, while some parts are inside a lambda, requiring extra effort.

I don't get this one either. The SQL window is defined in OVER () and that part of the SQL spec is fully outside the function. What part of the window is "partly inside" and "requires extra effort"?

  1. Defining a frame boundary results in invalid syntax for ROW_NUMBER.

I am not sure why having a frame into a window makes ROW_NUMBER invalid? PostgreSQL 17 runs the following just fine:
select row_number() over (order by 1 rows 1 preceding)

Re. your proposal advantages:

  1. Provides a unified syntax for defining both window and non-window functions, ensuring consistency even for aggregates.

I don't understand how having different options in the same aggregate function, depending on whether it's called on a group or a window, is "unified". In fact, I mention the exact opposite of this as advantage 3. above.

  1. Stronger compiler control over function definitions.

Do you have a concrete example? I am not aware of any aggregate-like functions that imposes requirements on its window.
(Conceptually / based on SQL standard: I have mentioned in this thread that I know no RDBMS that supports framing combined with ordered windowed functions -- e.g. string_agg -- but that's in theory possible and could be implemented one day.)
See my test of ROW_NUMBER above; also discussion of LEAD below.

Re. your disadvantages

  1. Requires DefineWindow for reusable window definitions, though this could be beneficial if named definitions are needed.

If we want an explicitly named function with my syntax, it's as easy as let w = Window("name"), no real difference here.

  1. (new) Windowed functions are harder to find when typing code, esp. if not built-in.
    Are they on Sql, Sql.Window, Sql.Functions, and where do I find provider-specific or user-defined ones?
    This problem is not specific to windowed functions, but it can be avoided if they are all extensions of Window.

Re. your comment below on LAG and LEAD

I don't get the issue?

Define range in your proposed syntax

It would be chained after OrderBy, but unavailable after ThenBy (range only works with 1 expression).
I think this works identically in your proposal?
Exact method syntax and semantics to be discussed, but maybe:

// OVER (ORDER BY age RANGE BETWEEN 10 PRECEDING AND 5 FOLLOWING)
Over().OrderBy(x.Age).Range(10, 5);

For example LAG, LEAD do not allow range definition.

What do you mean? I tested the following in PostgreSQL 17 and it works:

-- no order? not useful but works
select lead(n) over () 
from Test

-- range?
select lead(n) over (order by n range 1 preceding)
from Test

@sdanyliv
Copy link
Member Author

@jods4, sorry, but is hard to follow your answer, add citations to numbers. Show more examples (full examples) than words. Define range in your proposed syntax. For example LAG, LEAD do not allow range definition.

@jods4
Copy link
Contributor

jods4 commented Mar 17, 2025

@sdanyliv no worry, there's a lot to process! I have edited my post above, hopefully it's easier to read now.

@sdanyliv
Copy link
Member Author

Postgres 17 introduces new functionality. Oracle and other database providers currently don't support frame definitions in these window functions.

The syntax you've provided for the Frame is the simplest case, defining unbounded ranges clearly:

OVER (ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) EXCLUDE CURRENT ROW

Consider also this example below, which is syntactically valid from a compiler perspective but difficult to understand clearly:

var query = 
    from t in db.Some
    from ss in (
        from ss in db.Some2
        let o = Sql.Over()
        let partition = o.PartitionBy(ss.Some2Field)
        let ordering = partition.OrderBy(t.SomeField)
        select ordering
    )
    select new
    {
        sum = ss.RowsRange
                .Between.CurrentRow.And.Unbounded
                .Exclude.CurrentRow
                .Sum(x => x.SomeValue)
    };

Just want to mention that I don't like that we cannot control this process. Something may work, but something will fail - additional code, additional tests.

@jods4
Copy link
Contributor

jods4 commented Mar 17, 2025

Postgres 17 introduces new functionality. Oracle and other database providers currently don't support frame definitions in these window functions.

Yes. This is why I feel we shouldn't go to crazy lengths to model specific RDBMS support in C# apis. Let's reflect the SQL standard.
SqlBuilder can throw exception if the provider doesn't support a feature -- or we can make a straight translation and the query will fail because of unsupported syntax. Developers can't be oblivious of their DB target.

The syntax you've provided for the Frame is the simplest case, defining unbounded ranges clearly:

I guess this will be the same regardless of which design we pick?
The shortest call could be to take nullable int, although it's not very self-descriptive:

  • null = UNBOUNDED [PRECEDING|FOLLOWING] (only one makes sense, frame end is forbidden to be UNBOUNDED PRECEDING, likewise frame start can't be UNBOUNDED FOLLOWING)
  • n < 0 = n PRECEDING
  • 0 = CURRENT
  • n > 0 = n FOLLOWING

Exclusion could be an enum overload, similar to how we handle NULLS FIRST|LAST.

So for your example:

// OVER (ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) EXCLUDE CURRENT ROW
Sql.Over()
  .OrderBy(t)
  .Rows(null, 0, Exclude.CurrentRow)

Consider also this example below, which is syntactically valid from a compiler perspective but difficult to understand clearly:
[.Over() misuse in subqueries]

Okay that's crazy, I didn't even consider it.
Can the compiler easily limit Over() usage?
When calling .Over() you start a window expression, which must end with an aggregate-like function, or it is a compilation error.

  • Sql.Window(w => ...) returns Window. Only acceptable uses of this type are: being passed to Sql.Over(wnd) or Sql.Window(wnd) for extension, and being passed to an aggregate-like function (-> results in OVER(..) clause).
  • Sql.Over() starts a window expression that MUST end with an aggregate-like function, i.e. the final expression value cannot be IOver (or similar).

@sdanyliv
Copy link
Member Author

sdanyliv commented Mar 17, 2025

Sql.Over() starts a window expression that MUST end with an aggregate-like function, i.e. the final expression value cannot be IOver (or similar).

I'm not sure about the compiler feature you're referring to, but the proposed implementation won't compile.

Anyway, I like idea about frame range (writing in prosed syntax)

Sql.Functions.Sum(x. Some, f => f.Distinct().Over()
    .PartitionBy(x.Some)
    .OrderBy(x.Some)
    .Rows(null, 0)
//    .Range(null, 0)
//    .Groups(null, 0)
    .ExcludeCurrentRow()
//    .ExcludeNoOthers()
//    .ExcludeGroup()
//    .ExcludeTies()
)

Looking at syntax and https://www.sqlite.org/windowfunctions.html - do we handle all cases?

@jods4
Copy link
Contributor

jods4 commented Mar 17, 2025

I think that's what I meant: linq2db builder should refuse to compile invalid Over() expressions, such as the example you provide with let and subqueries.

@sdanyliv
Copy link
Member Author

@jods4, check updated comment.

I understand your point, but I'm still hesitant - I'm strongly inclined to prevent compiling bad functions. Let me first create a table of the functions we plan to implement, and then we'll discuss further.

@jods4
Copy link
Contributor

jods4 commented Mar 17, 2025

Looking at syntax and https://www.sqlite.org/windowfunctions.html - do we handle all cases?

We need an overload that takes only the frame start .Rows(1) and then we do, see also my summary of PostgreSQL specs above.

For ORDER BY, PostgreSQL supports custom comparators with USING operator, which I don't think we support, but that's not specific to OVER.

@igor-tkachev
Copy link
Member

In SQL, the OVER clause is the key element that defines a Window Function, as specified in the official SQL documentation:
Microsoft SQL Server - OVER Clause.

Based on this, our API should reflect this fundamental structure. So, our Window Function API should always start with Sql.Window and end with .Over(...).

Everything else - Ranking Functions, Aggregate Functions, Analytic Functions - should be placed between these two constructs. This way, we can clearly separate regular Aggregate Functions from Window Aggregate Functions, avoiding confusion in usage.

According to the specification, the OVER clause consists of:

OVER ( 
       [ <PARTITION BY clause> ]
       [ <ORDER BY clause> ]
       [ <ROW or RANGE clause> ]
      )

This means that an API like:

Over().PartitionBy().OrderBy()

would be problematic because the entire construct must return a column value, and there's no clear way to bind the method chain to anything meaningful. Alternatively, we could make the chain end with something like:

Over().PartitionBy().OrderBy().Value()

but that introduces unnecessary complexity (say hello to Builder pattern), which may not be ideal in this case.

Given these constraints, we have two viable options:

Lambda-based configuration:

Over(x => x.PartitionBy(...).OrderBy(...).RowOrRange(...))

This approach is concise and consistent with other fluent APIs. It avoids ambiguity in interpreting Over(x => x), ensuring clarity in the function's purpose.

Multiple lambda overloads for PartitionBy, OrderBy, etc:

Over(x => x, x => x)

This would require multiple overloads to cover different cases. But this one introduces ambiguity:

Over(x => x) // Is this PartitionBy or OrderBy?

Resolving this ambiguity would either require additional overloads or explicit method naming, both of which complicate API usage.

Given the above considerations, the most reasonable and practical API design would be:

Sql.Window.[SqlFunction()].Over(x => x.[PartitionBy(...)].[OrderBy(...)].[RowOrRange(...)])

@sdanyliv
Copy link
Member Author

@igor-tkachev, let's look at corner cases

// SUM(DISTINCT some) OVER (...)
Sql.Window.Sum(t.Value, f => f.Distinct).Over(f => f.[PartitionBy(...)].[OrderBy(...)].[RowOrRange(...)])

With one lambda for everything

Sql.Window.Sum(t.Value, f => f.Distinct.Over().[PartitionBy(...)].[OrderBy(...)].[RowOrRange(...)])

With window definition

let wnd = Sql.Window.DefineWindow(f => f.[PartitionBy(...)].[OrderBy(...)].[RowOrRange(...)]);
sum = Sql.Window.Sum(t.Value, f => f.Distinct.Over(wnd))

Can you catch these cases?

@igor-tkachev
Copy link
Member

@sdanyliv

A single lambda looks like a workable option but is less intuitive. As mentioned earlier, the OVER clause is the key element in defining Window Functions, so it makes sense to highlight it explicitly in the API.

There are already inconsistencies in your example:

f => f.Distinct.Over()

Here, Distinct has no direct relation to Over, so they should not be dependent on each other in the API. Distinct should be isolated at the level of Sum, while Over should remain a standalone method.

Keeping Over as an independent construct prevents such ambiguities and makes the API more intuitive and aligned with the SQL specification.

@jods4
Copy link
Contributor

jods4 commented Apr 9, 2025

Two funsies I encountered at work today, which should be accounted for in the syntax (those at least work in Oracle).

Postgre doesn't support this, but SQL standard defines an option RESPECT NULLS and IGNORE NULLS for lead, lag, first_value, last_value, and nth_value. This should be after the function but Oracle supports this syntax both inside (proprietary) and outside (standard). Example: last_value(order_id) ignore nulls over (order by order_date).

I believe FIRST and LAST (aka KEEP) are Oracle-specific, see https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/FIRST.html. They let you apply an aggregate function to the set of values that ranks first or last (often a single row, unless you have ties). They are very similar in nature to OVER framing, but:

  1. You can use FIRST and LAST with or without OVER (usable in a normal aggregate / group by as well).
  2. What it achieves actually can't be done with OVER frames (frames are always defined around current row, you can't say "absolute first")

It can be applied to MIN, MAX, SUM, AVG, COUNT, VARIANCE, or STDDEV. (EDIT: that's what Oracle docs say, but in practice I also apply it to other functions such as ANY_VALUE...).
Example: min(id) keep (dense_rank last order by salary)

@jods4
Copy link
Contributor

jods4 commented Apr 9, 2025

IMHO the first [RESPECT | IGNORE] NULLS should a fluent option when configuring LEAD and co.

For the second, I would be tempted to create a .KeepFirstBy[Desc]().ThenBy[Desc]().Min(val) on IEnumerable (for aggregates / group by) and on Over(). Still with the ideas that aggregate function should be last in C#, an example would be:

Sql.Over()
  .PartitionBy(country)
  .KeepLastBy(year, Order.NullsLast).ThenByDesc(population)
  .Min(gdp)
  
// MIN(gdp) KEEP (DENSE_RANK LAST ORDER BY year NULLS LAST, population DESC) OVER (PARTITION BY country)

@sdanyliv
Copy link
Member Author

sdanyliv commented Apr 9, 2025

@jods4, I avoid putting the function name at the end of a window function definition because it can lead to incorrect syntax that the compiler won't catch.

For me, enforcing strict syntax that prevents misuse is more important than readability.

@jods4
Copy link
Contributor

jods4 commented Apr 9, 2025

@sdanyliv didn't we go over this?
I agree that function-specific options should be inside a fluent option builder inside the call, like .Count(id, x => x.Distinct()).
What I would put in front is:

  • IEnumerable for regular GROUP BY aggregates (nobody argued with this one)
  • Over() for analytical windows
  • KeepFirst() / KeepLast() for consistency, as they apply to all aggregates and windows. I actually just noticed this is how it works in linq2db v5 BTW.

@jods4
Copy link
Contributor

jods4 commented Apr 9, 2025

If you want a real-world complex query to compare syntax and results, I coded this today:

  db.PartialStatementFile
    .Merge()
    .Using(
      PendingFiles()
        .Where(x => x.LastPageId == null)
        .Select(x => new
        {
          x.Id,
          LastPageId = Sql.Ext.Max(x.Id)
            .KeepLast().OrderBy(x.IsLastPage)
            .Over().PartitionBy(x.BankSwift, x.AccountNumber, x.StatementId)
            .ToValue(),
          HasLastPage = Sql.Ext.Max(x.IsLastPage)
            .Over().PartitionBy(x.BankSwift, x.AccountNumber, x.StatementId)
            .ToValue(),
          MaxPage = Sql.Ext.Max(x.PageNumber)
            .Over().PartitionBy(x.BankSwift, x.AccountNumber, x.StatementId)
            .ToValue(),
          PageCount = Sql.Ext.Count()
            .Over().PartitionBy(x.BankSwift, x.AccountNumber, x.StatementId)
            .ToValue(),
        })
        .Where(x => x.HasLastPage && x.PageCount >= x.MaxPage)
    )
    .On((stmt, g) => stmt.Id == g.Id)
    .UpdateWhenMatched((stmt, g) => new()
    {
      LastPageId = g.LastPageId
    })
    .Merge();

Resulting query (slightly reformatted):

MERGE INTO PartialStatementFile Target
USING (
  SELECT
    x.Id,
    x.LastPageId
  FROM
    (
      SELECT
        f.Id,
        MAX(f.Id) KEEP (DENSE_RANK LAST ORDER BY f.IsLastPage) OVER(PARTITION BY f.BankSwift, f.AccountNumber, f.StatementId) as LastPageId
        MAX(f.IsLastPage) OVER(PARTITION BY f.BankSwift, f.AccountNumber, f.StatementId) as HasLastPage,
        MAX(f.PageNumber) OVER(PARTITION BY f.BankSwift, f.AccountNumber, f.StatementId) as MaxPage,
        COUNT(*) OVER(PARTITION BY f.BankSwift, f.AccountNumber, f.StatementId) as PageCount,
      FROM
        PartialStatementFile f
      WHERE
        Nvl2(f.MergeFileId, NULL, f.Received) < :before AND
        f.LastPageId IS NULL
  ) x
  WHERE
    x.HasLastPage = 1 AND x.PageCount >= x.MaxPage
) Source
ON (Target.Id = Source.Id)

WHEN MATCHED THEN UPDATE SET
  Target.LastPageId = Source.LastPageId

My suggested syntax for those, based on previous discussions:

LastPageId = Sql.Over(wnd).KeepLastBy(x.IsLastPage).Max(x.Id)

@jods4
Copy link
Contributor

jods4 commented Apr 9, 2025

Feature request based on example above: it would be great in RDBMS that don't support named WINDOW yet, if linq2db could emulate the feature, simply by repeating the OVER(..) definition.

You can see in my complex example above that I wrote multiple times
.Over().PartitionBy(x.BankSwift, x.AccountNumber, x.StatementId)

Even though Oracle doesn't support WINDOW yet, I would love to just do it once and then .Over(wnd), code would be more readable.

@sdanyliv
Copy link
Member Author

Please focus solely on the syntax—don’t try to cover everything at once. We can define anything later; the key is figuring out how.

Here are my concerns:

  • Placing the function at the end introduces some issues—we can't prevent the user from defining an incorrect window function.
  • While chaining looks elegant, it comes with several drawbacks described several times.

@jods4
Copy link
Contributor

jods4 commented Apr 10, 2025

we can't prevent the user from defining an incorrect window function.

What is an example of an incorrect window function?

While chaining looks elegant, it comes with several drawbacks described several times.

I agree on some drawbacks, which is why I now think that not everything must be chained fluently.
For instance, although it's more nesting, I lean towards a builder inside .Over() instead of fluent syntax. For several reasons but mostly (1) to avoid misuse of windows that you've shown before, and (2) to disambiguate what belongs to OVER and what another options (I'm getting a feel there are many...).
Sql.Over(w => w.PartitionBy(x.Color).OrderBy(x.Size)).Max(x.Cost)

# Conflicts:
#	Source/LinqToDB/Linq/Builder/SequenceHelper.cs
#	Source/LinqToDB/Linq/Translation/ProviderMemberTranslatorDefault.cs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants