The Wayback Machine - https://web.archive.org/web/20201011131329/https://github.com/dotnet/BenchmarkDotNet/pull/1543
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory diagnoser fix for Tiered Compilation #1543

Open
wants to merge 3 commits into
base: master
from

Conversation

@adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Oct 1, 2020

In #1542 @ronbrogan has reported a very unusual bug - a code that was CPU bound and not allocating at all was reporting allocations for .NET Core 3.1 (it works fine for 2.1 as there we use GC.GetAllocatedBytesForCurrentThread instead GC.GetTotalAllocatedBytes):

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

namespace BenchmarkAllocationDebug
{
    class Program
    {
        static void Main() => BenchmarkRunner.Run<Bench>();
    }

    [KeepBenchmarkFiles]
    public class Bench
    {
        [Benchmark] public ulong Benchmark1() => TestMethod();
        [Benchmark] public ulong Benchmark2() => TestMethod();
        [Benchmark] public ulong Benchmark3() => TestMethod();
        [Benchmark] public ulong Benchmark4() => TestMethod();
        [Benchmark] public ulong Benchmark5() => TestMethod();

        [MethodImpl(MethodImplOptions.NoInlining)]
        public static ulong TestMethod()
        {
            var r = 1ul;
            for (var i = 0; i < 50_000_000; i++)
            {
                r /= 1;
            }
            return r;
        }
    }
}

After some investigation, I've narrowed down the problem to Tiered JIT thread that from time to time would be promoting methods from Tier 0 to Tier 1 and allocating memory during the iteration where we call GC.GetTotalAllocatedBytes.

obraz

I had few ideas, but the only one worked was putting the thread to sleep for 250ms before we call GC.GetTotalAllocatedBytes. In this time TC thread kicks-in and promotes the methods. It's of course far from perfect as TC might not finish the promotion before we make the first call to GC.GetTotalAllocatedBytes. I don't want to prolong the sleeping period because it would increase the time we need to run the benchmarks.

@kouvel do you have a better idea of how we could prevent TC from working at a given point of time?

I've confirmed that it works as expected by modifying the GetExtraStats method to emit some extra events and filtering the TC events in PerfView to this particular period of time:

private (GcStats, ThreadingStats) GetExtraStats(IterationData data)
{
    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStart(data.IterationMode, data.IterationStage, -2);

    if (RuntimeInformation.IsNetCore)
    {
        // we put the current thread to sleep so Tiered Compiler can kick in, compile it's stuff
        // and NOT allocate anything on the background thread when we are measuring allocations
        System.Threading.Thread.Sleep(TimeSpan.FromMilliseconds(250));
    }

    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStop(data.IterationMode, data.IterationStage, -2);

    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStart(data.IterationMode, data.IterationStage, -1);

    // we enable monitoring after main target run, for this single iteration which is executed at the end
    // so even if we enable AppDomain monitoring in separate process
    // it does not matter, because we have already obtained the results!
    EnableMonitoring();

    IterationSetupAction(); // we run iteration setup first, so even if it allocates, it is not included in the results

    var initialThreadingStats = ThreadingStats.ReadInitial(); // this method might allocate
    var initialGcStats = GcStats.ReadInitial();

    WorkloadAction(data.InvokeCount / data.UnrollFactor);

    var finalGcStats = GcStats.ReadFinal();
    var finalThreadingStats = ThreadingStats.ReadFinal();

    IterationCleanupAction(); // we run iteration cleanup after collecting GC stats

    GcStats gcStats = (finalGcStats - initialGcStats).WithTotalOperations(data.InvokeCount * OperationsPerInvoke);
    ThreadingStats threadingStats = (finalThreadingStats - initialThreadingStats).WithTotalOperations(data.InvokeCount * OperationsPerInvoke);

    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStop(data.IterationMode, data.IterationStage, -1);

    return (gcStats, threadingStats);
}

obraz

obraz

adamsitnik added 3 commits Oct 1, 2020
…iteration, the Tiered JIT might kick-in and allocate some memory and affect the results

as a workaround, we can put the thread to sleep for more than 200ms to TC thread kicks in before we start memory measurements

it's far from perfect but it works

fixes #1542
@adamsitnik adamsitnik added this to the v0.12.2 milestone Oct 1, 2020
@adamsitnik adamsitnik requested a review from AndreyAkinshin Oct 1, 2020
@kouvel
Copy link
Member

@kouvel kouvel commented Oct 1, 2020

Hmm you had mentioned on the other issue that 5.0 is free of this issue, is this really specific to 3.1? Timings have changed between 3.1 and 5.0 such that the rejits would happen sooner in 5.0 but it seems unlikely that that would affect microbenchmarks (the change is easily visible in real-world cases with many thousands of methods but there would only be a few methods here). For allocation measurements especially with long-running methods and with a strong assertion about allocation (that there shouldn't be any) generally my suggestion is to disable tiered compilation. Tier 0 jitted code may allocate where optimized code would not, and the former may allocate much more than the latter. An alternative to the delay may be to set COMPlus_TC_AggressiveTiering=1, which should tier-up more quickly, but it would still not guarantee that things would get tiered-up in time, if the involved methods are very long-running some methods may not reach tier 1.

Aside from that, I wonder what is actually causing allocation during jitting. There are some small allocation-dependent things I'm aware of regarding virtual methods that can be fixed, but otherwise on a rejit since all static construction, etc. should have already been done I'm not sure what would be allocating. Do you have an idea?

@kouvel
Copy link
Member

@kouvel kouvel commented Oct 1, 2020

Also I don't think there should be any guarantee that there would not be any allocation happening in the background. As we move more stuff to managed code like the thread pool, things may happen in the background unrelated to the app's code that may cause allocation and that should be ok.

@adamsitnik
Copy link
Member Author

@adamsitnik adamsitnik commented Oct 2, 2020

Hmm you had mentioned on the other issue that 5.0 is free of this issue, is this really specific to 3.1?

Yes, it looks like it's specific to 3.1 (I did not test 2.2 and 3.0 as they are not supported anymore)

BenchmarkDotNet=v0.12.1.20201002-develop, OS=Windows 10.0.18363.1082 (1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.100-rc.1.20452.10
  [Host]     : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-WTPPRD : .NET 5.0.0 (5.0.20.45114), X64 RyuJIT
  Job-MUUDGX : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-TXYPKS : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  Job-IEOLEB : .NET Framework 4.8 (4.8.4220.0), X64 RyuJIT


|     Method |              Runtime | Allocated |
|----------- |--------------------- |----------:|
| Benchmark1 |             .NET 5.0 |         - |
| Benchmark1 |        .NET Core 2.1 |         - |
| Benchmark1 |        .NET Core 3.1 |       9 B |
| Benchmark1 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark2 |             .NET 5.0 |         - |
| Benchmark2 |        .NET Core 2.1 |         - |
| Benchmark2 |        .NET Core 3.1 |      20 B |
| Benchmark2 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark3 |             .NET 5.0 |         - |
| Benchmark3 |        .NET Core 2.1 |         - |
| Benchmark3 |        .NET Core 3.1 |      21 B |
| Benchmark3 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark4 |             .NET 5.0 |         - |
| Benchmark4 |        .NET Core 2.1 |         - |
| Benchmark4 |        .NET Core 3.1 |     178 B |
| Benchmark4 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark5 |             .NET 5.0 |         - |
| Benchmark5 |        .NET Core 2.1 |         - |
| Benchmark5 |        .NET Core 3.1 |     101 B |
| Benchmark5 |   .NET Framework 4.8 |         - |
@kouvel
Copy link
Member

@kouvel kouvel commented Oct 2, 2020

Ok I think it is the virtual slot backpatching storage. It's probably the rejit timing that is causing the allocation to happen earlier and not show up in the benchmark in 5.0.

@adamsitnik
Copy link
Member Author

@adamsitnik adamsitnik commented Oct 2, 2020

An alternative to the delay may be to set COMPlus_TC_AggressiveTiering=1

I am afraid that this could lead to BDN reporting "too perfect" results that could differ from what end-users with default settings are experiencing

Aside from that, I wonder what is actually causing allocation during jitting

I was also curious and tried to use VS Memory Profiler to find out, but I've failed.

In this particular case the VS Profiler shows me memory allocated for JITTing of the methods that are executed for the first time:

obraz

But it does not show me anything attributed to the TP Thread and TieredCompilationManager.OptimizeMethods

@adamsitnik
Copy link
Member Author

@adamsitnik adamsitnik commented Oct 2, 2020

Also I don't think there should be any guarantee that there would not be any allocation happening in the background.

I totally agree. But I also expect the users to be quite suprised when they see that BDN reports allocated memory for a code that clearly does not allocate anything in explicit way.

@kouvel
Copy link
Member

@kouvel kouvel commented Oct 2, 2020

An alternative to the delay may be to set COMPlus_TC_AggressiveTiering=1

I am afraid that this could lead to BDN reporting "too perfect" results that could differ from what end-users with default settings are experiencing

Possibly. With aggressive tiering, it still goes through the normal tiering stages, just more quickly. It would change rejit timings and more methods may be rejitted, though likely the code quality would be similar to default mode (no guarantee though since it can depend on timing and code paths hit in the calls before rejit). The other option is to give the test more warmup time to stabilize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.