Skip to content

Serial and OMP optimization of EwaldRef #2176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 2, 2020

Conversation

jtkrogel
Copy link
Contributor

I took a stab at optimizing the performance of EwaldRef by focusing on serial optimization prior to adding threading.

As noted in #2169, the k-space part dominated the computational load in the original implementation. By considering the asymptotic decay of the error function (erfc(x)->exp(-x^2)/x), you can write down a condition that balances the asymptotic decay of the real and k-space parts (same exponent). This balancing condition leads to kappa=1.

Balancing the exponents results in the following speedup for the 32 atom NiO cell (run in serial on laptop):

Exponents Time
Original 81.1845 s
Balanced 0.3007 s

For the 128 atom cell, EwaldRef now takes about 8 seconds on a single core:

Size Time
32 0.3007 s
64 1.4828 s
128 7.4832 s

I next added OMP threading by first flattening the triangular charge and pair distance data into serial arrays (temporary N^2 memory cost) and then performing a single OMP reduction over the serialized, data parallel flat loop.

For the 128 atom problem, this resulted in the following times for each level of threading (note: there are 12 cores on the laptop):

Threads Time
1 7.6578 s
2 3.8986 s
4 2.1098 s
8 1.5173 s
16 1.2456 s

A few final adjustments are needed following discussion prior to merge.

@qmc-robot
Copy link

Can one of the admins verify this patch?

@prckent
Copy link
Contributor

prckent commented Dec 30, 2019

Terrific speedup.

@jtkrogel, @ye-luo Given #2172 what do you suggest? Can the optimized kappa choice be added to Ye's recently merged code? Or is the code in this PR preferred?

@jtkrogel
Copy link
Contributor Author

It could, though as a reference implementation the comparative simplicity of the code in this PR is an advantage and probably a 2 second cost is good enough.

Also, the prior threading optimizations targeted the k-space part which should now be roughly balanced with the real-space portion.

@markdewing
Copy link
Contributor

okay to test

@prckent
Copy link
Contributor

prckent commented Dec 30, 2019

My 2 cents: the speed of the improved method and code in this PR is certainly good enough and the simplicity is extremely appealing. Probably MPI is no longer needed even for our unthreaded tests since the code is fast enough.

Thanks also for adding the timer.

@ye-luo
Copy link
Contributor

ye-luo commented Dec 31, 2019

I followed @jtkrogel 's direction and found the balance between R and K parts by making the exp() and erfc() balance at the same large enough index shell (ix, iy, iz).
For the MadelungSum Eq 7, the optimal choice is sqrt(pi)/Radius
For the EwaldSum Eq 6, the optimal choice is Radius/sqrt(2 pi).
The radius is approximated from the volume as a sphere.
The algorithmic cost of EwaldSum is N_atom^2 times of MadelungSum. The original code picked a good choice for MadelungSum which is terrible for EwaldSum at large cells.
I cannot obtain the kappa = 1.0 as @jtkrogel but my choice gives me another factor of 2 speed up. After connecting the tiling code, 16 threads Xeon gives me 0.0457 second with the 128 atom cell.

      EwaldSumOpt                      0.0456     0.0237              1       0.045625925
        Kspace                         0.0010     0.0010              1       0.000994921
        Rspace                         0.0209     0.0209              1       0.020933867

EwaldSumOpt timing has about 1/2 not captured by K and R space due to threading load imbalance.
The time difference between K and R part is due to the fact that I only optimized the K part.
Because the R and K are balanced now, I'd like to also optimize the R part (separated PR) to simplify the current code paths, although the performance improvements should be unimportant.

@prckent
Copy link
Contributor

prckent commented Dec 31, 2019

@jtkrogel Are you OK with this change? Given the sparse coverage of reviewers and approvers at this time of year I will merge if you also approve.

@ye-luo Before any more code optimization or restructuring is done here, @jtkrogel will investigate better algorithms to cope with anisotropic cells. We discussed in person yesterday and thought that choosing the cutoff on a per axis basis might be a better approach. A spherical cutoff does not make sense for these cells. This is straightforward to investigate with simple code and can be done after this PR is in. (It is worth investigating since the knowledge will be useful for improving the production Coulomb evaluation code.) The code in this PR is already good enough for the release in my opinion and will close #2158 once the timings from the nightlies bear out what is stated above.

@ye-luo
Copy link
Contributor

ye-luo commented Dec 31, 2019

I also tried the 512 atom cell, 20 sec with kappa = 1.0 and after my change, 0.7 sec.
The tiling optimization code path is separated from the base code path.
@jtkrogel if (false) code path in the function ewaldEnergy is the base code path and can be used to explore other schemes without interference from my optimization.

@jtkrogel
Copy link
Contributor Author

Sorry to be slow on the reply. I was looking more into optimal kappa yesterday and I came to basically the same conclusion as @ye-luo for kappa.

Here are updated results of the simply parallelized code (single omp parallel for):

size kappa threads=1 threads=16
128 1.0 7.6578 s 1.2456 s
128 opt 0.3805 s 0.0638 s
256 opt 1.4934 s 0.2363 s
512 opt 7.1191 s 1.1219 s

I get 0.06 s for 128 and 1.12 s for 512 w/ 16 threads on my mobile Xeon with the simply parallelized code. This compares with 0.05 s for 128 and 0.7 s for 512 that Ye finds with the more complex code on his (non-mobile?) Xeon.

The simple code is as fast as the more complex code. Also, any further optimizations below 1 second are not meaningful here.

I strongly prefer using simpler code as the reference. If all agree, I will make a new PR with the simplified code + optimal kappa.

@prckent
Copy link
Contributor

prckent commented Dec 31, 2019

@jtkrogel Please make the simple PR. Simple code wins here.

@jtkrogel
Copy link
Contributor Author

OK, I will do this on Thursday.

@jtkrogel jtkrogel closed this Dec 31, 2019
@anbenali
Copy link
Contributor

anbenali commented Dec 31, 2019 via email

@ye-luo
Copy link
Contributor

ye-luo commented Dec 31, 2019

The change in the current PR is simple and we can use it right away. I prefer to merge it as it is. The optimized code had been already merged in the develop. @jtkrogel I don’t understand what you mean by making a simple PR. If you prefer using the simple code path, we can just switch the default. The optimized code gains the speed up by vectorization and data reuse which a simple threading scheme cannot achieve. I saw a factor of 2-4 gain. A large asymmetric cell can be much worse than NiO we tested. The optimized code path has some lengthy code which I can handle latter.

@ye-luo ye-luo reopened this Dec 31, 2019
@prckent prckent self-requested a review January 1, 2020 00:26
@ye-luo
Copy link
Contributor

ye-luo commented Jan 1, 2020

By the way, in the initial commits, flattening the triangle loop needs extra Staging memory for qq and rr. The simple code path under if (false) uses a simple parallel for without extra memory.

Copy link
Contributor

@prckent prckent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My benevolent dictator/principal investigator decision is that for the reference Ewald code we will go with the simple source code. This is trivially understandable by non-ninjas and fast enough. The tiled code is harder to understand.

There will be plenty of time to demonstrate our optimization chops with the actual Coulomb code used in production - please all recall that this is the real problem here. This PR deals with a validation check that is run once and not thousands-millions of times. After @rcclay has implemented a switch to the other Ewald code, we still need improved understanding of how to reduce the numerical work/improve the converge in anisotropic cells while maintaining good convergence in more isotropic (bulklike) cells. After that we can optimize to our hearts content - hopefully not obfuscating the code too much :-)

The main consequence of the Coulomb potential bug that I am happy about is our switch to convergence parameters that are physical quantities users care about (energy tolerances) vs more abstract quantities such as basis set size. This needs to be propagated across the application.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 1, 2020

@prckent Fully agreed. Let me delete the tiling optimization directly here. The reason I prefer first merging the deleting is to save it in the history and we can dig it out when it is really needed.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 1, 2020

@prckent The code is as light as before adding the optimized code.
@jtkrogel The current simple code path on my Xeon Gold 6130.

size kappa threads=1 threads=16
128 opt 0.4810 s 0.0472 s
512 opt 7.4871 s 0.7436 s

Compiler cannot vectorize the simple code path well and my Xeon has low clock, so my single core run gets longer. Since my Xeon have 16 real cores, my threaded run gets much faster.
This performance is good enough for our current use.

@prckent
Copy link
Contributor

prckent commented Jan 2, 2020

Cleaned version looks good to me. Preserving the tiled code in the git history is a good idea (i.e. using the branch in this PR). @jtkrogel If you indicate that you are OK with the cleaned code I'll approve and merge.

@jtkrogel
Copy link
Contributor Author

jtkrogel commented Jan 2, 2020

I added some comments and restored the simpler threading. This is just as fast and more transparent.

There is a temporary memory cost to flattening the triangular loop, but even for a 1024 atom system it amounts to 2M real values or about 16 MB. If this code is ever adapted for production use, the use of these temporary arrays will be eliminated by having access to distance tables.

Ready for merge. I will next consider the anisotropy intrinsic to the quasi-2D vdW systems.

@prckent
Copy link
Contributor

prckent commented Jan 2, 2020

Test this please

@prckent prckent merged commit 4284b05 into QMCPACK:develop Jan 2, 2020
@ye-luo
Copy link
Contributor

ye-luo commented Jan 6, 2020

Leave some notes of the understanding from my last optimization attempt which is in the history but not visible.
The k-space term is expensive when calculation sincos in exp(G (Ri-Rj)), a direct computation cost Ng x Natom x Natom. We can separate this term into exp(Gs Ri) x exp(-Gs Rj) and the cost reduces to 2 x Ng x Natom.
The r-space term is expensive in erfc(|Rs -(Ri - Rj)|). It costs Nr x Natom x Natom. I don't have a way to reduce its scaling.
The kappa choice we have in the current code assumes the same scaling for both K and R space computation. I saw more or less the same time spent in both parts.
To have a fully balanced K and R space computation, we need to take into account

  1. the single function call cost of sincos() and erfc()
  2. the scaling difference between K and R space computation if we use the separate form to reduce computation.

@jtkrogel jtkrogel deleted the ewald_ref_opt branch March 22, 2021 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants