GraalJS is around 70X slower than NodeJS 14 #360
Comments
@cip they are aware of that facts and it will get handeled soon. Many People are working on that as rule of Thumb that applys at present you can always say when ever you use node-graal and u use Nativ Node Modules you will get that slower results it is also importent that you set the engine to lazy and maybe even put a warmup function into your code. Revisit that in 6 Month and you will get much better results i have written down your name and will ping you when that point is reached :) But i want to give you a current example that proves that graaljs is faster in general google for es4x and try it that is using Java + JS and less node modules and outperforms nodejs by 10x times in most scenarios of the empower tech benchmark |
Thank you @frank-dspeed for your assurance! To be fair, I don't need to that GraalVM to be that fast, but not to be that slow. I think that if after warmup in 1 year from now GraalVM would reach let's say 7-8 seconds (so in 5x times slower than NodeJS) for this workload, would make it usable for this type of workload: "parsing medium/large documents". |
@ciplogic i think you understand that a bit wrong https://www.techempower.com/benchmarks/#section=data-r19&hw=ph&test=composite when you look here for es4x that is in fact graaljs it ranks on overall rank 9 and nestjs which is in efect optimized nodejs ranks 49 simply wait a bit and see what happens i am confident you will be impressed. else i would not waist my time with that. ConclusionAs soon as there is no nodejs dependencie needed for graaljs it is up to 13x faster on some workloads as it is overall more memory and cpu efficent. And we did not talk about optimization we talk about the out of the box case. When we account optimization into that so for example tune the jvm inlining parameters we can end up in hugh bigger numbers !! The JVM has 100+ Options for Compiler tuning and is supporting Custom GC Implementations. All this will never be possible with NodeJS that is a fact. There a boundaries because of the way NodeJS is Designed and this Physical boundaries do not apply to GraalJS. hope that could sheet some light. |
Or to Explain it even more simple the NodeJS and V8 code is static compiled AOT and only JS gets optimized with GraalJS the Whole NodeJS and V8 Gets Optimized . |
and for your understanding to get a function warm it is enough to call it 100 times fast with example data that takes some ms not hours or years :) |
@frank-dspeed: ciplogic probably meant 1 year of our engineering time to improve warming up, not 1 year of actually warming up :-) |
Hi @ciplogic, thanks for your request. And thanks for providing a simple to run example, that is much appreciated as it makes my job much easier. Yes, our warmup performance can be very bad, and also peak performance might not be where we want it to be on some examples. Code patterns like yours will help us improve in the future. Warmup performance is our main focus at the moment, we should significantly improve over the next few versions. Looking at your example, however, I cannot reproduce the current stage as you do. Yes, we are behind, but not as catastrophically as in your data. Can you please check whether you are using a current version of GraalVM, (20.2.0 being the most recent currently)? Maybe you can also share some basic specs of your machine, like how many cores, how much memory, do you use CE or EE version of GraalVM, etc. On my machine, I get the following data (warmup generally stabilized after ~5 iterations, that what I show here; I am using Node 12.x because that is what GraalVM provides as well. I also tried with Node 14.x but the results were comparable): Node.js 12.18.0:
GraalVM EE 20.2.0:
GraalVM CE 20.2.0:
(Using GraalVM EE data unless mentioned otherwise) So while our first iteration is 37s compared to 1.9 seconds, fourth iteration is 1.695 ms VS 11.730 ms - that's a factor of 7.3. Totally agree, we need to improve both warmup AND peak performance on this example, but I still wonder why your data is so much worse? Noteworthy, CE data is worse for the first ~2 iterations, but on peak, it's 11 compared to 13 seconds, so not a huge difference. A surprise contender might be the
While the warmup takes longer than in Best, |
Hi @wirthi , I will do the test with EE and I was using the latest public version of CE. To me a factor of 2.5X is acceptable, even with a long warm-up! And if you will need, I will try to get some off-the-shelf React-Native application and build it and attach it that would reproduce the behavior. Or other application that has to be quite large. Also, I personally don't fully care about warmup, my expectation would be that I can wrap it as a web-service (using express maybe!?) and I can expect something like 1-2 minutes warm-up as I can throw some data to this service at startup, obviously, if it would take 1 hour to warm-up, it is impractical. So let me see when I have time and I will add more data. So if I understand correctly, you need the following:
|
Hi @ciplogic yes, an express application is a good example: it will have some slow warmup, i.e. the first few iterations will be significantly slower. The warmup time should be in the seconds to minutes though, until you reach peak (not hours). Regarding your questions:
Best, |
So let's start with the file: this is a real project (React-native) file which is copied 4 times to match the input size of the project I was operating with. (it is 12 MB vs 11.1 MB) As I understand that there is not a large difference between CE and EE, I will not test it yet (because will take a bit longer to test and just right now I have not that much time to install/download) I will do it with CE. I had the latest version as far as I can see. My machine spec is: Linux x64 Mint (with all updates), AMD 3900X (12c), 48GB RAM) Output: (ran under WebStorm)
Removing
I ran just 3 iterations, but the choice is clear for Under NodeJS (version 14)
If you have problems reproducing or you think that I must run EE, I am glad to make it. |
@ciplogic could you please also try
|
@frank-dspeed here, it is the fastest:
|
As I updated the EE version, and the numbers do improve, but still, they are quite high. The code is this, but it should run in basically the same time, but I wanted to give some hint where the time goes (as expected, most of it is in parsing):
And now all the times are with EE version: With
JVM mode latency is a wash now with warmed up --jvm
Defaults:
And NodeJS for reference:
|
i also found out it is the parsing i have a unrelated hint for you maybe you can and should use the truffle api to parse the ast :). sorry i my self did not finish a acorn compatible ast imple from truffle but i am working on it anyway as i need it for other stuff like typescript and rollup webpack and so on |
I added a side experiment: increasing GC size (not sure if I done it correctly) but it looks that it saves around 0,6 seconds in warm state (I can see around 30.5 seconds vs 31.1 seconds). To me this shows that the slow-down is not on GC (probably you know it already)
Webpacking seems to not improve the runtime, but if it would help, I am glad to consider a webpack like approach To generate a webpack file, this 'webpack.config.js' worked for me:
I could not see any relevant speedup from webpack-ing it, maybe the warmup (but I am not sure) looking like it is few seconds shorter. |
One last spam for today, as I just noticed that there is a: All output is the following:
If just in case formatting is wrong, I attached as a separate txt file: |
@frank-dspeed this is an off-topic reason why to open this issue and in general about my view on Java/GraalVM/GraalJS. Let's start with Java: I am a C# developer by trade and I am a good friend with Java from when I was doing High Frequency Trading in C# as Java was miles ahead (I am talking of years around 2010-2016), especially on the workloads where my company was doing at the time. I got to use so much Java in my side projects that when there was a scientific paper published by Intel Research where it shows Go as being faster than Java, I could spot even basic mistakes and I published into a journal a scientific paper showing how to properly use Java, and Java would win by around 1 order of magnitude in this workload. Given my respect I have with Java (earned respect BTW!), I see why a GraalVM like approach (meaning Java on Java) should be the way forward and I really hope that GraalVM will also include value types, because at least it would make some types of coding possible and it is tedious to recycle objects because there is no value type equivalent and you want to remove allocations, and I am sure that GraalVM will get there. As I was following the GraalVM presentations, one recommendation from InfoQ site (I am not related with Oracle or InfoQ) was to test our workloads and if we find bugs, to report them. This is what I did. So this workload is similar with my day to day working (I work on a company that does, among other things JavaScript obfuscation) and we do these operations (of parsing code and doing some replacements) using C++. Surprisingly, NodeJS runs faster than C++ in this workload (using Acorn parser), and the reasons are multiple (i.e. C++ allocator is typically slower than a GC, if you don't use a custom allocator, also we use a different parser, so we may have a less optimal algorithm at places), and I wanted to try GraalVM/JS as a curiosity, especially as it was advertised (rightly I would say) as a drop-in replacement for NodeJS. The original sample was a large customer application that we typically use it as a stress test internally, but it is not the largest. As described: if GraalVM would be 2.5X slower, will be on-par with our C++ implementation of the JS parser (maybe a bit slower, but not by much), which as a mater of future option, a company like mine could use GraalVM. Though, as I am not into business decisions, I cannot say if this will happen, but I can definitely say that if the runtime is 20X (or 70X if not using --jvm) slower, GraalVM at least with this type of workloads will not happen. So all is described with this bug is simply a mostly stress test which is more or less like a close-to-large use-cases that some company that parses JavaScript has as a target. And it tried to be a realistic test-case (as it is) |
@ciplogic I love your use case do not get me wrong I am also not affiliated with Oracle but I have the same use case as you I am parsing a lot of code as I am coding a KI Driven Compiler. And my Point of view is that if we would for example Replace some Methods that Acorn uses with java methods this will blow away any nodejs implementation. self see it not as a drop-in replacement I see it more as a Migration Tool to Port the Code Partial Incremental to Java. I am a big fan of Javascript as glue code I do that also on NODEJS whenever I have something expensive to process I go for Rust and neon bindings to create .node modules to process that and use them as it would be Javascript. Hope that information helps you to optimize your processes. |
I have tried again your latest example (#360 (comment)) on GraalVM 20.3 EE (to be release, but close to our current master or nightlies); I am using octane-mandree.js (5 MB js source) instead of your My measurements are around 4.15 seconds for For me, -- Christian |
@wirthi why is there no transition possible between latency and throughput i mean can it not do additional optimization ? i am wondering what is blocking that? can we some how store and preserve the optimization done ? some kind of cache? I know my question is maybe a bit wirred out of your view but i simply wonder what is the blocking part that stops the compiler from improving to the state that throughput gives us. and maybe we can reach throughput state once and then directly resume from that stage. |
@frank-dspeed this is a huge tradeoff game. We have to trade off compilation speed, compilation memory consumption, peak performance, interpreter performance, compiled code size, how speculative the code is (how often we have to throw it away), memory consumption of the result, security concerns, privacy concerns, and many more into this. That, while we only have heuristics about all those values (how to guess all those values for a guest language like JavaScript, that the compiler (GraalVM, Java) does not even understand)? We are doing our best here, and 60 years of compiler research, 20+ years of Java compiler research go into this. Our current focal point is warmup performance. |
we should also point out that this is maybe a doublicate or the same as |
Sorry @wirthi, the file was under that comment:
^ this is the main.zip file which was a react-native packed application. |
And to re-iterate, in this "main.zip" which is a real application (based on a open source tutorial, just web-packed) would give if I understand right: 30.466 (let's say 30.5 seconds - this is the best time per iteration) vs 2 seconds (it is 1.8 ) but I consider error margins, CPU turbo, whatever). (30.5 seconds are against So we talk of not 2.5X slow-down, but a 15X slowdown. My client app slows down by around 20X, but it is still one world apart from 2.5X. My assumption is that mandreel benchmarks are not deep on stack parsing code so the behavior is different around collections and the cost around quite recursive calls. I did not look into GraalJS, but I would assume that there is a cost related in "closure" copying, or something of this sort. Or maybe it is a structure that it stresses out wrongly your optimizer, this is just a guess. To be clear, the application "main.zip" is a real-world application (though it is not as badly behaving than the client application I was working on), is is basically a "medium sized React-Native tutorial" that is bundled and I share the bundle. Also, I separately logged, and it looks that on traversal of the tree GraalJS is under 2X slower (than NodeJS), just the parsing code is much more slower - which is more than 15X. I will write it as a raw link here, in case the main.zip is missed:
|
As a side note: not sure why, but our company works with a C++ implementation of a JS parser, and that code before being optimized was for that particular file (generating the same AST tree) around 7.5 seconds on release and some simple optimizations (around tokenization and some type specialization, i.e. "when strings are lenght 1, we pass a char type instead of the full string") did improve to around 6.3 seconds, so at least for C++ the code is somewhat memory sensitive. I am saying that maybe there are some optimizations that based on type information, or just that "strings with 1 char are passed as 1 char" could probably speedup the code, or maybe how some collections are iterated (i.e. optimize for "indexOf" when the "array is of strings", to create an internal map, probably lazily), could speed up realistically the code and could improve other similar codebases. |
Hi, I'll try to run the code on Monday and address it in more detail, but concerting this:
This clearly is warmup performance. As we are a Java-based just-in-time compiler, YES, our warmup performance IS worse than other engines that use other techniques. The 2.5x above is PEAK performance AFTER WARMUP. Our engine is by design optimized for peak performance of long-running server applications, not short-running command line apps. We are trying to improve that as well, this is currently our main focus, but currently that's not where we can shine at. -- Christian |
Thank you, but I am going to bet that it was not warmup:
It is clear by around iteration 4/5 the code is warmed up and the slow-down is somewhere else. I am not sure that in Mandreel file you don't have the same behavior and it may be a slow-down specific to samples like React-Native bundled samples. My work-place sample is a customer's app I cannot share (because I cannot share IP, even though this IP is somewhat public and extractable) which shows a bit worse slow-down (i.e. around 20X slow-down, not 15X) but I assume that the reason is the same, just exacerbated as the customer app has a deep AST tree to parse. |
Hello, first of all, I don't want to see that is bashing GraalVM/JS and this experiment is not wanting to look I am not appreciating the project as is, I know that it is really a hard work to do a full JS/NodeJS runtime.
So let's dig in: I want to parse large JS files using AcornJS and to try to make a visitor that would replace some construct in that file.
So imagine (or get some generated webpack) large file and parse it as follows:
With package.json file:
Instead of "deduplicateTypes" it would be a small checker of type and replacing some constructs (i.e. strings with the some encrypted string routine).
This code using NodeJS would run in around 1300 ms, when with GraalJS runs in around 105 000 ms, and I was iterating the program to wait GraalVM to get it to let's say 10 seconds and at least in up-to 20 iterations it didn't get under 105 seconds. (first iteration though was 141 519 ms). The input file which sadly I cannot share it, is into 12 MB generated JS code, and it is a ReactNative application.
The text was updated successfully, but these errors were encountered: