-
Notifications
You must be signed in to change notification settings - Fork 147
Serial and OMP optimization of EwaldRef #2176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
It could, though as a reference implementation the comparative simplicity of the code in this PR is an advantage and probably a 2 second cost is good enough. Also, the prior threading optimizations targeted the k-space part which should now be roughly balanced with the real-space portion. |
okay to test |
My 2 cents: the speed of the improved method and code in this PR is certainly good enough and the simplicity is extremely appealing. Probably MPI is no longer needed even for our unthreaded tests since the code is fast enough. Thanks also for adding the timer. |
I followed @jtkrogel 's direction and found the balance between R and K parts by making the exp() and erfc() balance at the same large enough index shell (ix, iy, iz).
EwaldSumOpt timing has about 1/2 not captured by K and R space due to threading load imbalance. |
@jtkrogel Are you OK with this change? Given the sparse coverage of reviewers and approvers at this time of year I will merge if you also approve. @ye-luo Before any more code optimization or restructuring is done here, @jtkrogel will investigate better algorithms to cope with anisotropic cells. We discussed in person yesterday and thought that choosing the cutoff on a per axis basis might be a better approach. A spherical cutoff does not make sense for these cells. This is straightforward to investigate with simple code and can be done after this PR is in. (It is worth investigating since the knowledge will be useful for improving the production Coulomb evaluation code.) The code in this PR is already good enough for the release in my opinion and will close #2158 once the timings from the nightlies bear out what is stated above. |
I also tried the 512 atom cell, 20 sec with kappa = 1.0 and after my change, 0.7 sec. |
Sorry to be slow on the reply. I was looking more into optimal kappa yesterday and I came to basically the same conclusion as @ye-luo for kappa. Here are updated results of the simply parallelized code (single omp parallel for):
I get 0.06 s for 128 and 1.12 s for 512 w/ 16 threads on my mobile Xeon with the simply parallelized code. This compares with 0.05 s for 128 and 0.7 s for 512 that Ye finds with the more complex code on his (non-mobile?) Xeon. The simple code is as fast as the more complex code. Also, any further optimizations below 1 second are not meaningful here. I strongly prefer using simpler code as the reference. If all agree, I will make a new PR with the simplified code + optimal kappa. |
@jtkrogel Please make the simple PR. Simple code wins here. |
OK, I will do this on Thursday. |
Impressive!
…On Tue, Dec 31, 2019, 15:46 Jaron T. Krogel ***@***.***> wrote:
Closed #2176 <#2176>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2176?email_source=notifications&email_token=AF4XVQ4J4T6L5COD2573QCTQ3O4R5A5CNFSM4KBOYLE2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOVXIKCCY#event-2916131083>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF4XVQY2DNLNBUWTMQVNUC3Q3O4R5ANCNFSM4KBOYLEQ>
.
|
The change in the current PR is simple and we can use it right away. I prefer to merge it as it is. The optimized code had been already merged in the develop. @jtkrogel I don’t understand what you mean by making a simple PR. If you prefer using the simple code path, we can just switch the default. The optimized code gains the speed up by vectorization and data reuse which a simple threading scheme cannot achieve. I saw a factor of 2-4 gain. A large asymmetric cell can be much worse than NiO we tested. The optimized code path has some lengthy code which I can handle latter. |
By the way, in the initial commits, flattening the triangle loop needs extra Staging memory for qq and rr. The simple code path under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My benevolent dictator/principal investigator decision is that for the reference Ewald code we will go with the simple source code. This is trivially understandable by non-ninjas and fast enough. The tiled code is harder to understand.
There will be plenty of time to demonstrate our optimization chops with the actual Coulomb code used in production - please all recall that this is the real problem here. This PR deals with a validation check that is run once and not thousands-millions of times. After @rcclay has implemented a switch to the other Ewald code, we still need improved understanding of how to reduce the numerical work/improve the converge in anisotropic cells while maintaining good convergence in more isotropic (bulklike) cells. After that we can optimize to our hearts content - hopefully not obfuscating the code too much :-)
The main consequence of the Coulomb potential bug that I am happy about is our switch to convergence parameters that are physical quantities users care about (energy tolerances) vs more abstract quantities such as basis set size. This needs to be propagated across the application.
@prckent Fully agreed. Let me delete the tiling optimization directly here. The reason I prefer first merging the deleting is to save it in the history and we can dig it out when it is really needed. |
@prckent The code is as light as before adding the optimized code.
Compiler cannot vectorize the simple code path well and my Xeon has low clock, so my single core run gets longer. Since my Xeon have 16 real cores, my threaded run gets much faster. |
Cleaned version looks good to me. Preserving the tiled code in the git history is a good idea (i.e. using the branch in this PR). @jtkrogel If you indicate that you are OK with the cleaned code I'll approve and merge. |
I added some comments and restored the simpler threading. This is just as fast and more transparent. There is a temporary memory cost to flattening the triangular loop, but even for a 1024 atom system it amounts to 2M real values or about 16 MB. If this code is ever adapted for production use, the use of these temporary arrays will be eliminated by having access to distance tables. Ready for merge. I will next consider the anisotropy intrinsic to the quasi-2D vdW systems. |
Test this please |
Leave some notes of the understanding from my last optimization attempt which is in the history but not visible.
|
I took a stab at optimizing the performance of EwaldRef by focusing on serial optimization prior to adding threading.
As noted in #2169, the k-space part dominated the computational load in the original implementation. By considering the asymptotic decay of the error function (erfc(x)->exp(-x^2)/x), you can write down a condition that balances the asymptotic decay of the real and k-space parts (same exponent). This balancing condition leads to kappa=1.
Balancing the exponents results in the following speedup for the 32 atom NiO cell (run in serial on laptop):
For the 128 atom cell, EwaldRef now takes about 8 seconds on a single core:
I next added OMP threading by first flattening the triangular charge and pair distance data into serial arrays (temporary N^2 memory cost) and then performing a single OMP reduction over the serialized, data parallel flat loop.
For the 128 atom problem, this resulted in the following times for each level of threading (note: there are 12 cores on the laptop):
A few final adjustments are needed following discussion prior to merge.