The Wayback Machine - https://web.archive.org/web/20220525223804/https://github.com/JSQLParser/JSqlParser/pull/1382
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced Keywords #1382

Open
wants to merge 44 commits into
base: master
Choose a base branch
from

Conversation

manticore-projects
Copy link
Contributor

@manticore-projects manticore-projects commented Oct 18, 2021

Rewrite the Reserved Keywords Logic:

  1. Automatically Derive all Keyword Tokens from the Parser.
  2. Explicitly add Reserved Keywords to the Grammar and classify, why/when it is reserved.
  3. Auto-generate the list of Allowed Keywords as Difference of 1) and 2) above
  4. Provide Gradle task to semi-auto generate RelObjectNameWithoutValue() based on 3)
  5. Parametrized Tests for Keywords, testing all keywords

Advantage:
a) we have now a clear documentation which Keywords are reserved for what reason, with proper tests
b) allowed keywords are generated (semi-)automatically, especially when more Tokens are added to the Grammar. New Tokens won't break SQL Statements, which have been parsed fine before.

To-Dos:
c) Composite Tokens do not work well and still need to be added manually to ALL_KEYWORDS (we would need to refactor the GRAMMAR in order to avoid such composite tokens)
d) @wumpz: Clarify the meaning/purpose of the various RelObjectNamexxxx() Productions, so we can auto generate them too.
e) Use the Gradle task to fully inject the RelObjectNamexxxx() Productions (instead of manually updating the code)

Resolves one more Special Oracle Test.
Fixes #1148
Fixes #1450
Fixes #1443
Fixes #1462
Fixes #1508
Fixes #1538

Add Keywords and document, which keywords are allowed for what purpose
Copy link
Member

@wumpz wumpz left a comment

Again, are all those keyword additions to relobjectnames needed? You talked about all those edge case improvements. Are most of those additions not exactly that?

src/main/jjtree/net/sf/jsqlparser/parser/JSqlParserCC.jjt Outdated Show resolved Hide resolved
@@ -1733,7 +2007,7 @@ Table TableWithAlias():
Alias alias = null;
}
{
table=Table() [alias=Alias() { table.setAlias(alias); }]
table=Table() [ LOOKAHEAD(2) alias=Alias() { table.setAlias(alias); }]
Copy link
Member

@wumpz wumpz Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the aim to do less not more LOOKAHEADs. All this for some keywords?

Copy link
Contributor Author

@manticore-projects manticore-projects Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all this for some keywords (although the Alias() Production is affected mostly)
But why exactly would you deny Tokens, which are not defined in the holy SQL:2016 standard as keywords when there is no compelling technical reason ?

Complaints about keywords are felt the second most concern of the users, right after quoting/escaping text and objects.

Derive All Keywords from Grammar directly
Generate production for Object Names (semi-) automatically
Add parametrized Keyword Tests
@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented Nov 1, 2021

@wumpz: Please check out the great work on PR #1254. It adds a new token QUICK and which of course should not become a Reserved Keyword.

Of course the author @OlivierCavadenti has no chance to know, that he would need to add QUICK to the productions RelObjectNamexxxx() in order to enable this token as Object Identifier. (And how should he, I figured that out only 6 month after my first contribution.)

This PR would solve this challenge reliably, preventing new tokens from breaking valid statements and ease the entry for new contributors.

@manticore-projects manticore-projects requested a review from wumpz Nov 28, 2021
@@ -227,6 +242,13 @@ task renderRR() {
}
}
}

task updateKeywords(type: JavaExec) {
Copy link
Member

@wumpz wumpz Nov 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I do not use gradle, what does this do?

Maven is the main build engine. You changed some gradle build options. Does the maven build still run?

Copy link
Contributor Author

@manticore-projects manticore-projects Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It generates and prints the Source Code Text of the method RelObjectName() which you would manually insert/replace in the JSQL Grammar file.

This step is optional and does not affect the Maven build. It is executed manually and on demand only.

Although the long term goal was to have a mechanism inside the build tool (Gradle and/or Maven), which during the build:

  1. analyses the KeyWords automatically
  2. modifies the Grammar file (methods RelObjectName() and friends)
  3. builds the Parser with the modified Grammar
  4. runs the Keyword Tests

@wumpz
Copy link
Member

@wumpz wumpz commented Nov 28, 2021

The naming of this RelObjectName ... methods is sure an issue that should be adressed in futher improvements. So sorry for that: its grown ... ;)

@wumpz
Copy link
Member

@wumpz wumpz commented Nov 28, 2021

I am not a fan of injecting those keywords, since not every keyword could be introduced this way and needs other parts of the grammar to be modified. However, building this keyword testing using JUnit 5 is cool.

@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented Nov 29, 2021

The naming of this RelObjectName ... methods is sure an issue that should be adressed in futher improvements. So sorry for that: its grown ... ;)

You would do me a great favor by documenting the purpose of those 5 methods verbosely. I tried my best to figure it out by trial'n error but I am still not exactly sure why we ended up with 5 methods. (And no worries about grown code.)

@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented Nov 29, 2021

I am not a fan of injecting those keywords, since not every keyword could be introduced this way and needs other parts of the grammar to be modified. However, building this keyword testing using JUnit 5 is cool.

Honestly, I am not a big fan of that PR either -- but after spending many days on that matter trying various things it was the best approach I found and in my opinion it is still much better than the current state:

  1. at the moment, Keywords are just a mess and we get a lot of issues about keywords only.
    Are you able to document right now, what keywords are reserved and for what reason? I do not think so and the PR would solve this documentation problem.

  2. today, even when you have an opinion on keywords, we have no parametrized tests.
    You kindly admit, that these tests are kind of cool and I appreciate your feedback. Those parametrized tests however depend on a well defined list of keywords, which the PR provides. (Although we can argue of course, how/where exactly such a list should be defined.)

not every keyword could be introduced this way and needs other parts of the grammar to be modified

  1. While that is true for the PR, it is also true by now already. Also we could easily enforce a policy of "Keywords are defined by simple Tokens" (not allowing complex Tokens for Keywords).

The work flow is to: a) define the Tokens and Productions in the Grammar first and then b) document any new RESERVED Keywords only in the List (emphasis on RESERVED).

The PR has one big advantaged: By defining RESERVED keywords only, it will determine automatically any allowed Keywords and modify the Grammar semi-automatically.

@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented Nov 29, 2021

Allow me to ask please:

  1. what was your preferred way to manage and document RESERVED KEYWORDs

  2. what was your preferred way to maintain any allowed Token/Keyword in RelObjectName()

  3. what was your preferred way to test all the possible Keywords (automatically)

And can we have a Skype chat about this :-)

@wumpz
Copy link
Member

@wumpz wumpz commented Apr 9, 2022

I took some days to think about this again. The goal is to automate all allowed keywords tests. Right? Maybe the documentation.

So this externalization of keywords in getReservedKeywords is a helper method to build parameterized tests for all this stuff. But this results in this weird workflow regenerating the grammer.

Wouldn't it be better to parse the grammar file and simply extract those list the getReservedKeywords provides? After that all automated tests could be done like you propose and no regenerating the grammar is needed.

By the way, renaming all RelObjectNames methods is a must.

@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented Apr 9, 2022

I took some days to think about this again. The goal is to automate all allowed keywords tests. Right? Maybe the documentation.

Yes, and also: to automatically allow all Tokens, that are not reserved keywords (especially after adding new tokens).

So this externalization of keywords in getReservedKeywords is a helper method to build parameterized tests for all this stuff. But this results in this weird workflow regenerating the grammer.

No. getReservedKeywords is needed for knowing, which tokens to allow in RelObjectName() and friends.

Wouldn't it be better to parse the grammar file and simply extract those list the getReservedKeywords provides? After that all automated tests could be done like you propose and no regenerating the grammar is needed.

I think, you are still miss the main point: Tests and Documentation are nice side products.
The main goal though is to Allow the permitted Tokens (as difference between All Tokens and the Reserved Keywords), whenever new Tokens are added.

By the way, renaming all RelObjectNames methods is a must.
Fully agreed.

Can we have a call on this topic maybe? I feel we could clarify the approach within a few minutes when we just talk about it.

@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented Apr 9, 2022

So this externalization of keywords in getReservedKeywords is a helper method to build parameterized tests for all this stuff. But this results in this weird workflow regenerating the grammer.

Actually, not much has changed: in 99% of all cases, you just build as before. Only when a new Token has been defined, you would run gradle buildKeywords first and even then it is optional (at the cost of Allowed Tokens and Documentation).

@wumpz
Copy link
Member

@wumpz wumpz commented Apr 9, 2022

You are right, but if we read out those keywords directly from the grammar file we never need to run this buildKeywords because always all tests use the actual list. As always the documentation should be much more improved as well as the source code documentation (if I may make this difference).

# Conflicts:
#	build.gradle
#	src/main/jjtree/net/sf/jsqlparser/parser/JSqlParserCC.jjt
#	src/test/java/net/sf/jsqlparser/statement/select/SpecialOracleTest.java
# Conflicts:
#	build.gradle
#	ruleset.xml
- read Tokens per REGEX Matcher
- move Reserved Keywords from Grammar into ParserKeywordsUtils
- adjust the Tests
@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented Apr 23, 2022

Greetings.

I have modified the approach and read the Tokens directly from the Grammar file without invoking JTREE.
Thus all the Keywords have been moved our of the Grammar into ParserKeywordUtils and we spare the redundant run.

It handles also the Composite Tokens.

Cheers.

@wumpz
Copy link
Member

@wumpz wumpz commented Apr 24, 2022

I did not look into your modifications, but I think I managed to parse the grammar using JavaCCs classes. (javacc/javacc#221)

@wumpz
Copy link
Member

@wumpz wumpz commented Apr 24, 2022

So if I understand your modifications right, you are scanning for all tokens defined in the token part. These different RelObjectName productions are used for different purposes. So depending on in which a token is used the tests should be different right?

@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented Apr 25, 2022

I did not look into your modifications, but I think I managed to parse the grammar using JavaCCs classes. (javacc/javacc#221)

Yes, I saw that. Thank you for your effort.
I am not sure, if this is better or worse than Parsing the Grammar per Regex because again, your engage the JavaCC Parser (now only from inside Java instead of from Gradle/Maven Plugin before).

At the same time i really do not care much as long as we get the Content of all tokens reliably (which seems to work both ways).

So if I understand your modifications right, you are scanning for all tokens defined in the token part.
Yes, but as per REGEX (because I do not want to rely on the K_ or S_ prefixes and also want to get the keywords of the compound tokens.

These different RelObjectName productions are used for different purposes. So depending on in which a token is used the tests should be different right?

Yes, thats why there are two tests sofar.

@wumpz
Copy link
Member

@wumpz wumpz commented May 11, 2022

So these two tests are enough? So we could merge some of those RelObjectName productions?

How we extract the keywords does not matter that much. What matters to me is the direct use of keywords without rebuilding a list, rebuilding the grammar building the parser, and so on.

However, to read out all keywords correctly one has to rebuild this part of JavaCCs grammar parser, so why not reuse the original and not having troubles with some incomplete regular expressions?

@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented May 12, 2022

So these two tests are enough? So we could merge some of those RelObjectName productions?

I do see this PR as a first step to move us into a better direction. Certainly more fine tuning will need to follow. Although I feel like this PR is quite massive already and I would like to see a Merge first, before touching anything else.

This particular development step marries the best out of both worlds:

  1. it demonstrates the idea and the benefits
  2. it is still fully compatible with what has been there before and does not break anything

Please understand, that maintaining multiple PRs in parallel is quite a nightmare for any contributor and a lot of effort goes into keeping up with origin/master, depending on your Merge order.

How we extract the keywords does not matter that much. What matters to me is the direct use of keywords without rebuilding a list, rebuilding the grammar building the parser, and so on.

However, to read out all keywords correctly one has to rebuild this part of JavaCCs grammar parser, so why not reuse the original and not having troubles with some incomplete regular expressions?

Fully agreed, I would provide a similar method using JavaCC instead of the RegEx (so we could switch on/off later on demand).
JavaCC would be the default, but we could still use the RegEx whenever we would hit a snag.

Add some tests to ensure, that all Keywords or found
@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented May 13, 2022

I implemented they getAllKeywords() method based on JTree and got it working.
Although there are pros and cons:

  1. We really get ALL keywords even from when not declared as a Token -- that's awesome!
  2. The implementation is quite complex and verbose
  3. We carry now JavaCC as a dependency, although we needed it only for the compilation -- I do not like that at all, but have no good idea what to do about that.

Copy link
Member

@wumpz wumpz left a comment

I still do not get, why you want to hold this keyword list. Since we could extract all keywords, like you mentioned, and read all RelObj .. methods, we have all lists we need to get the reserved keyword list directly from the grammar without rewriting it.

The dependency of JavaCC is only needed for tests. So this ParserKeywordsUtils list should be a test class only.

Sure as I already mentioned, the RelObj - methods needs to be renamed to better reflect there meaning and use.

@@ -37,7 +37,7 @@ public static Stream<String> keyWords() {
List<String> keywords = new ArrayList<>();
try {
try {
keywords.addAll( ParserKeywordsUtils.getAllKeywords(file) );
keywords.addAll( ParserKeywordsUtils.getAllKeywordsUsingRegex(file) );
Copy link
Member

@wumpz wumpz May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you use the JavaCC parser to just verify what your regex extracts? Why?

Copy link
Contributor Author

@manticore-projects manticore-projects May 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a bit of effort, to extract all keyword tokens based on your initial stub. I used the previous Regex generated List as reference. When both lists returned the same, I got confidence, that I got it working.

However, the JavaCC approach was based on trial an error and is not documented. So having a Test in the future might prove useful.

See it as the equivalent of you toString() vs DeParser -- it is redundant, but provides a great way to test the correctness.

@SuppressWarnings({"PMD.ExcessiveMethodLength"})
public static List<String> getReservedKeywords(int restriction) {
// Classification follows http://www.h2database.com/html/advanced.html#keywords
Object[][] ALL_RESERVED_KEYWORDS = {
Copy link
Member

@wumpz wumpz May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't thoses lists directly derived from the grammar? At least RESTRICTED_ALIAS?

Copy link
Contributor Author

@manticore-projects manticore-projects May 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the other way around: This List will define, how we rewrite RelObjectName.... This List will be defined manually and everything else follows it.

@manticore-projects
Copy link
Contributor Author

@manticore-projects manticore-projects commented May 22, 2022

I still do not get, why you want to hold this keyword list.

We want to:

  1. Create/Rewrite RelObject.... automatically during the build process (at least whenever new tokens/productions have been added to the Grammer)

We will use the difference between ALL keywords and RESERVED keywords for this rewrite.
ALL keywords parsed from the Grammer, RESERVED keywords from the manually defined list.

Since we could extract all keywords, like you mentioned, and read all RelObj .. methods, we have all lists we need to get the reserved keyword list directly from the grammar without rewriting it.

It is the other way around: we get the ALLOWED keywords and rewrite RelObjectNames... accordingly.

Sure as I already mentioned, the RelObj - methods needs to be renamed to better reflect there meaning and use.

Certainly, I have asked you above for an explanation, what you had in mind when introducing the 4 methods. (The first 2 are clear to me, but I struggle to understand the methodology behind ...Ext and ...Ext2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants