Serial and OMP optimization of EwaldRef #2176

jtkrogel · 2019-12-30T17:54:54Z

I took a stab at optimizing the performance of EwaldRef by focusing on serial optimization prior to adding threading.

As noted in #2169, the k-space part dominated the computational load in the original implementation. By considering the asymptotic decay of the error function (erfc(x)->exp(-x^2)/x), you can write down a condition that balances the asymptotic decay of the real and k-space parts (same exponent). This balancing condition leads to kappa=1.

Balancing the exponents results in the following speedup for the 32 atom NiO cell (run in serial on laptop):

Exponents	Time
Original	81.1845 s
Balanced	0.3007 s

For the 128 atom cell, EwaldRef now takes about 8 seconds on a single core:

Size	Time
32	0.3007 s
64	1.4828 s
128	7.4832 s

I next added OMP threading by first flattening the triangular charge and pair distance data into serial arrays (temporary N^2 memory cost) and then performing a single OMP reduction over the serialized, data parallel flat loop.

For the 128 atom problem, this resulted in the following times for each level of threading (note: there are 12 cores on the laptop):

Threads	Time
1	7.6578 s
2	3.8986 s
4	2.1098 s
8	1.5173 s
16	1.2456 s

A few final adjustments are needed following discussion prior to merge.

qmc-robot · 2019-12-30T17:55:01Z

Can one of the admins verify this patch?

prckent · 2019-12-30T18:07:31Z

Terrific speedup.

@jtkrogel, @ye-luo Given #2172 what do you suggest? Can the optimized kappa choice be added to Ye's recently merged code? Or is the code in this PR preferred?

jtkrogel · 2019-12-30T18:26:18Z

It could, though as a reference implementation the comparative simplicity of the code in this PR is an advantage and probably a 2 second cost is good enough.

Also, the prior threading optimizations targeted the k-space part which should now be roughly balanced with the real-space portion.

markdewing · 2019-12-30T19:27:57Z

okay to test

prckent · 2019-12-30T20:01:05Z

My 2 cents: the speed of the improved method and code in this PR is certainly good enough and the simplicity is extremely appealing. Probably MPI is no longer needed even for our unthreaded tests since the code is fast enough.

Thanks also for adding the timer.

ye-luo · 2019-12-31T11:00:25Z

I followed @jtkrogel 's direction and found the balance between R and K parts by making the exp() and erfc() balance at the same large enough index shell (ix, iy, iz).
For the MadelungSum Eq 7, the optimal choice is sqrt(pi)/Radius
For the EwaldSum Eq 6, the optimal choice is Radius/sqrt(2 pi).
The radius is approximated from the volume as a sphere.
The algorithmic cost of EwaldSum is N_atom^2 times of MadelungSum. The original code picked a good choice for MadelungSum which is terrible for EwaldSum at large cells.
I cannot obtain the kappa = 1.0 as @jtkrogel but my choice gives me another factor of 2 speed up. After connecting the tiling code, 16 threads Xeon gives me 0.0457 second with the 128 atom cell.

      EwaldSumOpt                      0.0456     0.0237              1       0.045625925
        Kspace                         0.0010     0.0010              1       0.000994921
        Rspace                         0.0209     0.0209              1       0.020933867

EwaldSumOpt timing has about 1/2 not captured by K and R space due to threading load imbalance.
The time difference between K and R part is due to the fact that I only optimized the K part.
Because the R and K are balanced now, I'd like to also optimize the R part (separated PR) to simplify the current code paths, although the performance improvements should be unimportant.

prckent · 2019-12-31T14:20:14Z

@jtkrogel Are you OK with this change? Given the sparse coverage of reviewers and approvers at this time of year I will merge if you also approve.

@ye-luo Before any more code optimization or restructuring is done here, @jtkrogel will investigate better algorithms to cope with anisotropic cells. We discussed in person yesterday and thought that choosing the cutoff on a per axis basis might be a better approach. A spherical cutoff does not make sense for these cells. This is straightforward to investigate with simple code and can be done after this PR is in. (It is worth investigating since the knowledge will be useful for improving the production Coulomb evaluation code.) The code in this PR is already good enough for the release in my opinion and will close #2158 once the timings from the nightlies bear out what is stated above.

ye-luo · 2019-12-31T14:40:44Z

I also tried the 512 atom cell, 20 sec with kappa = 1.0 and after my change, 0.7 sec.
The tiling optimization code path is separated from the base code path.
@jtkrogel if (false) code path in the function ewaldEnergy is the base code path and can be used to explore other schemes without interference from my optimization.

jtkrogel · 2019-12-31T20:44:47Z

Sorry to be slow on the reply. I was looking more into optimal kappa yesterday and I came to basically the same conclusion as @ye-luo for kappa.

Here are updated results of the simply parallelized code (single omp parallel for):

size	kappa	threads=1	threads=16
128	1.0	7.6578 s	1.2456 s
128	opt	0.3805 s	0.0638 s
256	opt	1.4934 s	0.2363 s
512	opt	7.1191 s	1.1219 s

I get 0.06 s for 128 and 1.12 s for 512 w/ 16 threads on my mobile Xeon with the simply parallelized code. This compares with 0.05 s for 128 and 0.7 s for 512 that Ye finds with the more complex code on his (non-mobile?) Xeon.

The simple code is as fast as the more complex code. Also, any further optimizations below 1 second are not meaningful here.

I strongly prefer using simpler code as the reference. If all agree, I will make a new PR with the simplified code + optimal kappa.

prckent · 2019-12-31T21:08:36Z

@jtkrogel Please make the simple PR. Simple code wins here.

jtkrogel · 2019-12-31T21:46:03Z

OK, I will do this on Thursday.

anbenali · 2019-12-31T22:35:03Z

Impressive!

…

On Tue, Dec 31, 2019, 15:46 Jaron T. Krogel ***@***.***> wrote: Closed #2176 <#2176>. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2176?email_source=notifications&email_token=AF4XVQ4J4T6L5COD2573QCTQ3O4R5A5CNFSM4KBOYLE2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOVXIKCCY#event-2916131083>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF4XVQY2DNLNBUWTMQVNUC3Q3O4R5ANCNFSM4KBOYLEQ> .

ye-luo · 2019-12-31T23:58:27Z

The change in the current PR is simple and we can use it right away. I prefer to merge it as it is. The optimized code had been already merged in the develop. @jtkrogel I don’t understand what you mean by making a simple PR. If you prefer using the simple code path, we can just switch the default. The optimized code gains the speed up by vectorization and data reuse which a simple threading scheme cannot achieve. I saw a factor of 2-4 gain. A large asymmetric cell can be much worse than NiO we tested. The optimized code path has some lengthy code which I can handle latter.

ye-luo · 2020-01-01T00:32:04Z

By the way, in the initial commits, flattening the triangle loop needs extra Staging memory for qq and rr. The simple code path under if (false) uses a simple parallel for without extra memory.

prckent

My benevolent dictator/principal investigator decision is that for the reference Ewald code we will go with the simple source code. This is trivially understandable by non-ninjas and fast enough. The tiled code is harder to understand.

There will be plenty of time to demonstrate our optimization chops with the actual Coulomb code used in production - please all recall that this is the real problem here. This PR deals with a validation check that is run once and not thousands-millions of times. After @rcclay has implemented a switch to the other Ewald code, we still need improved understanding of how to reduce the numerical work/improve the converge in anisotropic cells while maintaining good convergence in more isotropic (bulklike) cells. After that we can optimize to our hearts content - hopefully not obfuscating the code too much :-)

The main consequence of the Coulomb potential bug that I am happy about is our switch to convergence parameters that are physical quantities users care about (energy tolerances) vs more abstract quantities such as basis set size. This needs to be propagated across the application.

ye-luo · 2020-01-01T00:49:03Z

@prckent Fully agreed. Let me delete the tiling optimization directly here. The reason I prefer first merging the deleting is to save it in the history and we can dig it out when it is really needed.

ye-luo · 2020-01-01T01:02:48Z

@prckent The code is as light as before adding the optimized code.
@jtkrogel The current simple code path on my Xeon Gold 6130.

size	kappa	threads=1	threads=16
128	opt	0.4810 s	0.0472 s
512	opt	7.4871 s	0.7436 s

Compiler cannot vectorize the simple code path well and my Xeon has low clock, so my single core run gets longer. Since my Xeon have 16 real cores, my threaded run gets much faster.
This performance is good enough for our current use.

prckent · 2020-01-02T14:35:48Z

Cleaned version looks good to me. Preserving the tiled code in the git history is a good idea (i.e. using the branch in this PR). @jtkrogel If you indicate that you are OK with the cleaned code I'll approve and merge.

jtkrogel · 2020-01-02T15:10:01Z

I added some comments and restored the simpler threading. This is just as fast and more transparent.

There is a temporary memory cost to flattening the triangular loop, but even for a 1024 atom system it amounts to 2M real values or about 16 MB. If this code is ever adapted for production use, the use of these temporary arrays will be eliminated by having access to distance tables.

Ready for merge. I will next consider the anisotropy intrinsic to the quasi-2D vdW systems.

prckent · 2020-01-02T17:07:32Z

Test this please

ye-luo · 2020-01-06T15:46:52Z

Leave some notes of the understanding from my last optimization attempt which is in the history but not visible.
The k-space term is expensive when calculation sincos in exp(G (Ri-Rj)), a direct computation cost Ng x Natom x Natom. We can separate this term into exp(Gs Ri) x exp(-Gs Rj) and the cost reduces to 2 x Ng x Natom.
The r-space term is expensive in erfc(|Rs -(Ri - Rj)|). It costs Nr x Natom x Natom. I don't have a way to reduce its scaling.
The kappa choice we have in the current code assumes the same scaling for both K and R space computation. I saw more or less the same time spent in both parts.
To have a fully balanced K and R space computation, we need to take into account

the single function call cost of sincos() and erfc()
the scaling difference between K and R space computation if we use the separate form to reduce computation.

jtkrogel added 3 commits December 30, 2019 12:06

add timer for EwaldRef

4798f51

balance real and k-space exponents in EwaldRef

a026d25

flatten pair sum and add threading

bc70726

jtkrogel and others added 3 commits December 30, 2019 15:15

keep only the omp sum

16747c6

Merge remote-tracking branch 'github/develop' into ewald_ref_opt

561d8ef

Alternative kappa

44b77db

ye-luo approved these changes Dec 31, 2019

View reviewed changes

ye-luo added 2 commits December 31, 2019 20:47

Reduce redundant code.

cf4dd45

Cover RspaceEwaldSum with tiling.

b2483d6

ye-luo added 2 commits December 31, 2019 09:08

Add omp simd.

09426ae

Fix compilation.

0e22549

jtkrogel closed this Dec 31, 2019

ye-luo reopened this Dec 31, 2019

prckent self-requested a review January 1, 2020 00:26

prckent requested changes Jan 1, 2020

View reviewed changes

Delete tiling optimization in EwaldRef.

9dade84

add comments and restore flat threading

fcdef64

prckent approved these changes Jan 2, 2020

View reviewed changes

prckent merged commit 4284b05 into QMCPACK:develop Jan 2, 2020

ye-luo mentioned this pull request Jan 6, 2020

[WIP] Anisotropic Ewald code #2182

Open

jtkrogel mentioned this pull request Jan 6, 2020

Anistropic grids for Ewald Coulomb evaluation #2185

Open

3 tasks

jtkrogel deleted the ewald_ref_opt branch March 22, 2021 13:30

Serial and OMP optimization of EwaldRef #2176

Serial and OMP optimization of EwaldRef #2176

Uh oh!

Conversation

jtkrogel commented Dec 30, 2019

Uh oh!

qmc-robot commented Dec 30, 2019

Uh oh!

prckent commented Dec 30, 2019

Uh oh!

jtkrogel commented Dec 30, 2019

Uh oh!

markdewing commented Dec 30, 2019

Uh oh!

prckent commented Dec 30, 2019

Uh oh!

ye-luo commented Dec 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prckent commented Dec 31, 2019

Uh oh!

ye-luo commented Dec 31, 2019

Uh oh!

jtkrogel commented Dec 31, 2019

Uh oh!

prckent commented Dec 31, 2019

Uh oh!

jtkrogel commented Dec 31, 2019

Uh oh!

anbenali commented Dec 31, 2019 via email

Uh oh!

ye-luo commented Dec 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ye-luo commented Jan 1, 2020

Uh oh!

prckent left a comment

Choose a reason for hiding this comment

Uh oh!

ye-luo commented Jan 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ye-luo commented Jan 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prckent commented Jan 2, 2020

Uh oh!

jtkrogel commented Jan 2, 2020

Uh oh!

prckent commented Jan 2, 2020

Uh oh!

ye-luo commented Jan 6, 2020

Uh oh!

Uh oh!

ye-luo commented Dec 31, 2019 •

edited

Loading

ye-luo commented Dec 31, 2019 •

edited

Loading

ye-luo commented Jan 1, 2020 •

edited

Loading

ye-luo commented Jan 1, 2020 •

edited

Loading