[RUMOR] Pascal in trouble with Asynchronous Compute

razor1 · Apr 1, 2016

? that makes no sense.

Anarchist4000 · Apr 1, 2016

Flexible, but too slow to be of practical use for a lot of features you would want. Same reason your Intel engineer likely said it wouldn't work. CPU scheduling, while flexible, would be too slow to pull off the features. CPU scheduling would only really work if you could predict when every warp would finish, and you'd still have the cost of that prediction. Hardware, with the ACEs for more queues, would let you pick which workload you wanted to schedule rather quickly, likely based on some really simple metrics.

razor1 · Apr 1, 2016

he wasn't talking about CPU scheduling, he as talking about in hardware. First part of my statement he didn't even get into. That is only about the HWS currently in Fiji GCN 1.3

Anarchist4000 · Apr 1, 2016

That's going to depend a lot of the context he was talking. Intel, last I checked, didn't have multiple queues to provide a selection of warps. So there are a lot of ifs in regards to what he may have said. AMD has already demonstrated selecting a compute wave over graphics with their QRQ. It could be as simple as knowing how many graphics and compute waves were currently residing on a compute unit. It should be a simple choice, not a prediction if that's what you asked him?

razor1 · Apr 1, 2016

What I asked him by quoting your statement

That's one area where the queue priorities would probably help. There is likely some control, albeit in drivers, over work distribution. Limit graphics to 70% occupancy, reserving 30% for compute for example. This will probably get exposed in Vulkan, DX12 is another matter(one compute queue). Those features should make async somewhat self tuning based on hardware. I know ACEs are programmable(drivers). It would make sense the work distributor could be configured as well. Score shaders by tex/memory:math ratio and attempt to balance all the compute units.

he specifically stated "space slicing" a work load is very hard to do with different work loads going through a pipeline, over different versions of the chip, small to big in the same generation, let alone different generations.

Anarchist4000 · Apr 1, 2016

My guess is he interpreted that as time slicing the load across the entire GPU, not a CU. You don't really time slice those, anything scheduled runs till completion. Worst case is the status quo, best case you get a ridiculously ALU bound compute load running alongside graphics.

razor1 · Apr 1, 2016

sorry mis spoke he sated space slicing, damn hang over from last night.

He also stated current console code will not run well on future hardware, there will most likely be no performance gains from async on future hardware with older titles (current ones), and hopefully there won't be regression of performance. This is specific to AMD hardware.

JustReason · Apr 1, 2016

razor1 said:
sorry mis spoke he sated space slicing, damn hang over from last night.

He also stated current console code will not run well on future hardware, there will most likely be no performance gains from async on future hardware with older titles (current ones), and hopefully there won't be regression of performance. This is specific to AMD hardware.

That sounds like downplaying.

razor1 · Apr 1, 2016

no its not, even at GDC, AMD pretty much stated the same thing, different hardware of different generation levels need different code to perform optimally.

I've been saying this for what 8 months now, and you think this is downplaying. Its just the way things are.

Anarchist4000 · Apr 1, 2016

razor1 said:
sorry mis spoke he sated space slicing, damn hang over from last night.

He also stated current console code will not run well on future hardware, there will most likely be no performance gains from async on future hardware with older titles (current ones), and hopefully there won't be regression of performance. This is specific to AMD hardware.

Space slicing, if I understand correctly, should be partitioning the compute units which is still different than what I had in mind. That'd be zero concurrency.

The console part makes sense. They should run well, but there just wouldn't be enough compute to keep all the ALUs busy.

razor1 said:
no its not, even at GDC, AMD pretty much stated the same thing, different hardware of different generation levels need different code to perform optimally.

I've been saying this for what 8 months now, and you think this is downplaying. Its just the way things are.

I'm not sure that was in regards to async specifically. My take was the color compression on fiji/tonga needing a unique path.

razor1 · Apr 1, 2016

Well I can tell ya this, he isn't the first person to say it nor will he be the last, and its not only tonga and fiji that saw wonky results with async enable games.

yes ALU utilization will have an affect of older titles on newer hardware didn't think of that before good point.

Anarchist4000 · Apr 11, 2016

razor1 said:
first response

1.0 certainly isn't, the ACEs had not been programmable back then at all.
1.1 is programmable, but the space is limited, and the space is required for the queue decoding logic.
1.2 might be able to do such a thing. But I'm not entirely sure what the HWS/ "new" ACE units on Tonga and Fiji are actually doing right now.

We have no idea of what the HWS units do, what Dave Baumann (works at AMD). told me was each HWS unit works like 2 ACE units. So I'm going with queue decoding logic isn't there and they need a lot of space for that.

I did PM a couple of Intel engineers as well, waiting on their responses, but I'm pretty sure I'll get a similar response

Apologies for the slight necro. But the feature I described probably looks a lot like this: Hardware Managed Ordered Circuit Patented by AMD in 2013.

An embodiment of the present invention provides an apparatus including a scoreboard structure configured to store information associated with a plurality of wavefronts. The apparatus further includes a controller, comprising a plurality of counters, configured to control an order of operations, such that a next one of the plurality of wavefronts to be processed is determined based on the stored information and an ordering scheme.

Apparently they've taken the tuning feature a step further as well: HETEROGENEOUS FUNCTION UNIT DISPATCH IN A GRAPHICS PROCESSING UNIT This is a discussion for another thread though. Point being, async isn't something that should need a lot of tuning on relatively new architectures.

razor1 · Apr 11, 2016

The methodology is simple and straight forward, which is what that patent is, the actual implementation is not. This is the same thing that everyone told me, and which I told you.

[RUMOR] Pascal in trouble with Asynchronous Compute

razor1

[H]F Junkie

Anarchist4000

[H]ard|Gawd

razor1

[H]F Junkie

Anarchist4000

[H]ard|Gawd

razor1

[H]F Junkie

Anarchist4000

[H]ard|Gawd

razor1

[H]F Junkie

JustReason

razor1 is my Lover

razor1

[H]F Junkie

Anarchist4000

[H]ard|Gawd

razor1

[H]F Junkie

Anarchist4000

[H]ard|Gawd

razor1

[H]F Junkie