question

Eden Soto avatar image
Eden Soto asked ·

GPU Cache Population Crashing with 4 GPUs... fine with just 1

4x 1080 Ti's... C4D R21 up-to-date, latest NVIDIA Studio Driver, latest C4DtoA plugin, here's the log...

gpu_cache_population_log.txt

I've deleted the cache in AppData\Local\NVIDIA\OptixCache\arnold-6.0.1.0_driver-441.66, but each time I try to run it with all four GPUs in the system, it fails (see screenshot below for final crash entry)... I delete the cache before each new attempt.

If I run it with just one GPU in the machine, it completes successfully... with all four in, it's a guaranteed crash.

For what it's worth, I can render with all 4 GPUs no problem with Redshift, so the GPUs themselves are fine.

Would love to know if anyone knows why this could be happening and/or how to get around it

c4dtoagpu
2 comments
10 |600 characters needed characters left characters exceeded

Up to 5 attachments (including images) can be used with a maximum of 2.0 MiB each and 9.8 MiB total.

If you skip the cache prepopulation and just render a simple scene using 1, 2, and 4 GPUs, does it work? The cache prepopulation is just there as a convenience, but not using it is ok if you don't mind the first few times you render scenes it might take a few minutes before pixels start to appear.

Knowing the answer to this will also help us pinpoint where the problem might be coming from.

0 Likes 0 · ·

The only way I can render with the GPU is if I just have one in there system... but check my reply to your other note below... the system is plenty capable, so it seems as though it’s something specific to Arnold rendering on the GPUs... CPU is perfectly fine

0 Likes 0 · ·
Thiago Ize avatar image
Thiago Ize answered ·

Thanks Eden and Dante for your detailed investigations! It looks like NVIDIA has been able to reproduce a driver bug where using 4 or more GPUs with textures will usually cause a crash. This bug could in theory also cause crashes when not using textures or using less than 4 GPUs, though we haven't yet been able to confirm how common this is. They're making good progress towards fixing it, so fingers crossed, it will hopefully get fixed in one of the upcoming driver releases.

In the meantime, you can render with 3 GPUs in order to drastically reduce the likelihood of a crash and we'll also investigate whether there are any workarounds we can do in Arnold.

2 comments Share
10 |600 characters needed characters left characters exceeded

Up to 5 attachments (including images) can be used with a maximum of 2.0 MiB each and 9.8 MiB total.

Fantastic news! Looking forward to getting a working solution so I can do some thorough testing

0 Likes 0 · ·

FWIW, no combination of GPU selection makes Arnold GPU render stable, so selecting only three didn’t help for me... I’m going to physically remove one and see if that will do it... but like I’ve said in another reply, I can render with all four GPUs in Redshift till the cows come home and it never flinches, which is why I think the issue is unique to Arnold in my setup.

0 Likes 0 · ·
Peter Horvath avatar image
Peter Horvath answered ·

We had multiple reports of similar issues, seems to be an issue with multiple GPUs. We're investigating. Thanks.

5 comments Share
10 |600 characters needed characters left characters exceeded

Up to 5 attachments (including images) can be used with a maximum of 2.0 MiB each and 9.8 MiB total.

Would love to get this resolved... I still haven't really been able to gauge speed differences between the GPUs and the CPU... I can get the IPR to sometimes work without crashing, but it will eventually crash C4D completely, sometimes even locking up the system requiring a hard reset

0 Likes 0 · ·

If your machine is locking up, that could maybe mean your power supply is unable to power your four GPUs and CPU. If you disable a GPU or two, does it now work? It's possible Arnold is putting more demands on all of your system which ends up tipping it over the power threshold.

Another possibility for a system appearing to lock up is that you ran out of CPU memory. If you run the task manager you might be able to see if that is happening.

Otherwise, system lockups can be indicative of a driver bug.

0 Likes 0 · ·

Just last week I did a 3:00 Redshift render with all 4 GPUs that took 18 hours... not a single hiccup, the GPUs are fine and I’m on the latest NVIDIA Studio Driver that hasn’t changed since then... Have a 2990WX Threadripper with 128GB of RAM and a 1600W power supply... there’s plenty there to handle it

0 Likes 0 · ·

I'm running into the same problem. With multiple GPUs (I'm running 4 RTX 2080ti's), they will use all the cards UNTIL you turn textures on. Disabling the Textures in the DEBUG window, renders fine. I've tried converting the textures to different formats (TGA/TIF/EXR/etc), with and without MIPs, etc. Nothing works, unless you want to ONLY run one card and see your textures in the render. Any update of when we might be able to see a fix?

2 Likes 2 · ·

I never tried all four when disabling textures... I'll try that today and see if I get the same results

1 Like 1 · ·
Stephen Blair avatar image
Stephen Blair answered ·

I'm going to restate the problem to make sure I have it right. Because the screenshot shows a crash during rendering.

But really the problem happens before the render, when the pre-population fails with this error:

[ERROR]     internal error during render permutation 11/40:

So, GPU Pre-Population failed, and then when you tried to render using that GPU cache, the render failed.


Can you delete everything in AppData\Local\NVIDIA\OptixCache, not just that one folder?

Then try to pre-populate again

1 comment Share
10 |600 characters needed characters left characters exceeded

Up to 5 attachments (including images) can be used with a maximum of 2.0 MiB each and 9.8 MiB total.

I emptied the whole AppData\Local\NVIDIA\OptixCache folder and ran the command again... it crashed again. New log attached.gpu-cache-population-log.txt

0 Likes 0 · ·
Vedran Klemen avatar image
Vedran Klemen answered ·

Why do you need gpu cache population? Is it render faster? Arnold 6 is very snappy without it...

4 comments Share
10 |600 characters needed characters left characters exceeded

Up to 5 attachments (including images) can be used with a maximum of 2.0 MiB each and 9.8 MiB total.

You don't need it. All it does is pre-compile some shaders into the gpu cache so that when the first few times you go to render after upgrading your Arnold or nvidia driver version, you don't need to wait a few minutes for the GPU shaders to be compiled since they are already in the cache. On the other hand, once you've rendered a scene, those shaders will now also be in the cache, so now there's now no need to pre-populate.

Personally, I don't usually bother to pre-populate.

0 Likes 0 · ·

Yes, i think it is the old technology already... :)

0 Likes 0 · ·

For me it’s been a way to test and determine why my machine crashes with more than one GPU in the system... that’s why I continue to run it... if the machine crashes in the IPR or during output render, I don’t have a way to figure out why without the log? At least I don’t know of any other way

0 Likes 0 · ·
Show more comments
Thiago Ize avatar image
Thiago Ize answered ·

Arnold 6.0.1.1, released today, should fix the multi-gpu texture hangs. This won't help with the gpu cache pre-population issues.

5 comments Share
10 |600 characters needed characters left characters exceeded

Up to 5 attachments (including images) can be used with a maximum of 2.0 MiB each and 9.8 MiB total.

Thanks, will try it out now

0 Likes 0 · ·

Opened an existing project I rendered with the CPU, switched to GPU... after a long period of seemingly nothing happening, C4D quit to the desktop... emptied the NVIDIA cache folders and tried again... another crash... so it's still a no-go for me using the GPUs in C4DtoA

0 Likes 0 · ·

Thanks for trying it out. It looks like this is likely a different bug from what we fixed -- what we fixed should help Dante out.

0 Likes 0 · ·

Hello Thiago! I'll try this out ASAP - thanks!!

0 Likes 0 · ·

I don't know if this will help, but nvidia released a new driver yesterday (442.19), so it might be worth a shot.

0 Likes 0 · ·
Dante Rinaldi avatar image
Dante Rinaldi answered ·

Hi everyone - just to add my experience again. It seems as if the new GPU code still is not handling textures as well as the CPU mode. I removed all my 2080ti's and went down to 2 RTX2080ti's (with NVlink to double the texture memory pool, will that work?) and large scenes still crash the renderer 100% of the time. I was under the impression that the new GPU code will MIP and tile down all textures and load on demand - is that still the case?
I have all the latest drivers (including the new 442.19). All debug modes work fine (subd, displacement, etc), but more than a few textures is a guaranteed crash. The CPU handles them great.
I just saw that a new Arnold version is available to download - I'll try this ASAP!
Also to note, I am using textures on references, and reference instances. I'll post a few logs tonight. If anyone has any ideas, I'd love to hear them - thanks!

4 comments Share
10 |600 characters needed characters left characters exceeded

Up to 5 attachments (including images) can be used with a maximum of 2.0 MiB each and 9.8 MiB total.

The GPU will still use more texture memory than CPU. This is a known issue and one that we're actively working on. Last I checked, nvlink does not double the texture memory available, though one of the future nvidia driver updates should I think fix that.

Yes, GPU loads on demand, but it currently loads much more texture data than CPU. For now you'll have to use the workaround of setting a max texture resolution in the render settings.

Hopefully with this new build you should be able to render on multiple GPUs with textures without problems, provided it fits in memory.

0 Likes 0 · ·

Thiago-
First, thanks for being so responsive and answering these threads and being awesome. All of my RTX 2080ti's are the 11GB variety - sorry for being a bit clueless, but is there a specific line or a way from reading the log files that the user can see the texture hit on the card? I tried the same scene last night by capping the max texture size in the Systems tab to 256 - it actually got to first pixel and crashed, which was encouraging.
Does the GPU engine (minus overhead) still MIP the TXs down by judging distance from camera?

0 Likes 0 · ·

Make sure you enable sufficient verbosity in the log file (see link to the right of this page for help on log files) and then at the end of a successful render (we do need to add a way to report memory used during a failed render) you can see the memory used (in this case, 3.5GB of texture memory on the GPU):

3:30 164MB | peak GPU memory consumed 5197.00MB
3:30 164MB |  output buffers            79.33MB
3:30 164MB |  geometry                 133.37MB
3:30 164MB |   polymesh                133.37MB
3:30 164MB |  texture cache           3552.00MB
3:30 164MB |  unaccounted             1432.30MB
0 Likes 0 · ·
Show more comments
Thiago Ize avatar image
Thiago Ize answered ·

We've now released Arnold 6.0.2.0 which fixed a hang/crash when doing the GPU cache pre-population on machines with lots of cores. I suspect this might have been the problem Eden was experiencing.

Hopefully this plus the previous Arnold release which fixed the multi-GPU texture hang, should solve the the multi-gpu issues reported in this arnold answers question.

Share
10 |600 characters needed characters left characters exceeded

Up to 5 attachments (including images) can be used with a maximum of 2.0 MiB each and 9.8 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 5 attachments (including images) can be used with a maximum of 2.0 MiB each and 9.8 MiB total.

Welcome to the Arnold Answers community.

This is the place for Arnold renderer users everywhere to ask and answer rendering questions, and share knowledge about using Arnold, Arnold plugins, workflows and developing tools with Arnold.

If you are a new user to Arnold Answers, please first check out our FAQ and User Guide for more information.

When posting questions, please be sure to select the appropriate Space for your Arnold plugin and include the plugin version you are using.

Please include images, scene and log files whenever possible as this helps the community answer your questions.

Instructions for generating full verbosity log files are available for MtoA, MaxtoA, C4DtoA, HtoA, KtoA, and Kick.

If you are looking for Arnold Documentation and Support please visit the Arnold Support site.

To try Arnold please visit the Arnold Trial page.

Bottom No panel present for this section.