As someone who is pretty skeptical and reads the fine print, I think this is a g...

vlovich123 · on July 17, 2024

AFAIK I believe all they did was move the closed source user space driver code to their opaque firmware blob leaving a thin shim in the kernel.

In essence I don’t believe that much has really changed here.

adrian_b · on July 18, 2024

Having as open-source all the kernel, more precisely all the privileged code, is much more important for security than having as open-source all the firmware of the peripheral devices.

Any closed-source privileged code cannot be audited and it may contain either intentional backdoors, or, more likely, bugs that can cause various undesirable effects, like crashes or privilege escalation.

On the other hand, in a properly designed modern computer any bad firmware of a peripheral device cannot have a worse effect than making that peripheral unusable.

The kernel should take care, e.g. by using the I/O MMU, that the peripheral cannot access anything where it could do damage, like the DRAM not assigned to it or the non-volatile memory (e.g. SSDs) or the network interfaces for communicating with external parties.

Even when the peripheral is so important as the display, a crash in its firmware would have no effect if the kernel had reserved some key combination to reset the GPU (while I am not aware of such a useful feature in Linux, its effect can frequently be achieved by switching, e.g. with Alt+F1, to a virtual console and then back to the GUI, the saving and restoring of the GPU state together with the switching of the video modes being enough to clear some corruption caused by a buggy GPU driver or a buggy mouse or keyboard driver).

In conclusion, making the NVIDIA kernel driver as open source does not deserve to have its importance minimized. It is an important contribution to a more secure OS kernel.

The only closed-source firmware that must be feared is that which comes from the CPU manufacturer, e.g. from Intel, AMD, Apple or Qualcomm.

All such firmware currently includes various features for remote management that are not publicly documented, so you can never be sure if they can be properly disabled, especially when the remote management can be done wirelessly, like through the WiFi interface of the Intel laptop CPUs, so you cannot interpose an external firewall to filter the network traffic of any "magic" packets.

A paranoid laptop user can circumvent the lack of control over the firmware blobs from the CPU manufacturer by disconnecting the internal antennas and using an external cheap and small single-board computer for all wired and wireless network access, which must run a firewall with tight rules. Such a SBC should be chosen among those for which complete hardware documentation is provided, i.e. including its schematics.

stragies · on July 18, 2024

Everything you wrote assumes the IOMMUs across the board to be 100% correctly implemented without errors/bugdoors.

People used to believe similar things about Hyperthreading, glitchability, ME, Cisco, boot-loaders, ... the list goes on.

adrian_b · on July 18, 2024

There still is a huge difference between running privileged code on the CPU, for which there is nothing limiting what it can do, and code that runs on a device, which should normally be contained by the I/O MMU, except if the I/O MMU is buggy.

The functions of an I/O MMU for checking and filtering the transfers are very simple, so the probability of non-intentional bugs is extremely small in comparison with the other things enumerated by you.

stragies · on July 18, 2024

Agreed, that the feature-set of IOMMU is fairly small, but is this function not usually included in one of the Chipset ICs, which do run a lot other code/functions alongside a (hopefully) faithful correct IOMMU routine?

Which -to my eyes- would increase the possibility of other system parts mucking with IOMMU restrictions, and/or triggering bugs.

saagarjha · on July 18, 2024

Did you run this through a LLM? I'm not sure what the point is of arguing with yourself and bringing up points that seem tangential to what you started off talking about (…security of GPUs?)

adrian_b · on July 18, 2024

I have not argued with myself. I do not see what made you believe this.

I have argued with "I don’t believe that much has really changed here", which is the text to which I have replied.

As I have explained, an open-source kernel module, even together with closed-source device firmware, is much more secure than a closed-source kernel module.

Therefore the truth is that a lot has changed here, contrary to the statement to which I have replied, as this change makes the OS kernel much more secure.

stkdump · on July 17, 2024

But the firmware runs directly on the hardware, right? So they effectively rearchitected their system to move what used to be 'above' the kernel to 'below' the kernel, which seems like a huge effort.

vlovich123 · on July 17, 2024

It’s some effort but I bet they added a classical serial CPU to run the existing code. In fact, [1] suggests that’s exactly what they did. I suspect they had other reasons to add the GSP so the amortized cost of moving the driver code to firmware was actually not that large all things considered and in the long term reduces their costs (eg they reduce the burden further of supporting multiple OSes, they can improve performance further theoretically, etc etc)

[1] https://download.nvidia.com/XFree86/Linux-x86_64/525.78.01/R...

p_l · on July 17, 2024

That's exactly what happened - Turing microarchitecture brought in new[1] "GSP" which is capable enough to run the task. Similar architecture happens AFAIK on Apple M-series where the GPU runs its own instance of RTOS talking with "application OS" over RPC.

[1] Turing GSP is not the first "classical serial CPU" in nvidia chips, it's just first that has enough juice to do the task. Unfortunately without recalling the name of the component it seems impossible to find it again thanks to search results being full of nvidia ARM and GSP pages...

mepian · on July 17, 2024

>the name of the component

Falcon?

p_l · on July 17, 2024

THANK YOU, that was the name I was forgetting :)

here's[1] a presentation from nvidia regarding (unsure if done or not) plan for replacing Falcon with RISC-V, [2] suggests the GSP is in fact the "NV-RISC" mentioned in [1]. Some work on reversing Falcon was apparently done for Switch hacking[3]?

[1] https://riscv.org/wp-content/uploads/2016/07/Tue1100_Nvidia_... [2] https://www.techpowerup.com/291088/nvidia-unlocks-gpu-system... [3] https://github.com/vbe0201/faucon

knotimpressed · on July 17, 2024

Would you happen to have a source or any further readings about Apple M-series GPUs running their own RTOS instance?

p_l · on July 17, 2024

Asahi Linux documentation has pretty good writeup.

The GPU is described here[1] and the mailbox interface used generally between various components is described here [2]

[1] https://github.com/AsahiLinux/docs/wiki/HW%3AAGX#overview

[2] https://github.com/AsahiLinux/docs/wiki/HW%3AASC

imtringued · on July 17, 2024

Why? It should make it much easier to support Nvidia GPUs on Windows, Linux, Arm/x86/RISC-V and more OSes with a single firmware codebase per GPU now.

stkdump · on July 17, 2024

Yes makes sense, in the long run it should make their life easier. I just suspect that the move itself was a big effort. But probably they can afford that nowadays.