Tuesday, March 1, 2011

cudaram - a block device exposing NVIDIA GPUs' RAM implemented with CUDA

I have been looking for a sillylearning project for Linux kernel for some time and I think I have finally come up with something suitable.
Why not use the extra free RAM on your GPU for something useful while not hindering your normal GPU use (vdpau, desktop effects etc.)? I started looking and I didn't really find anything of the sort, the closest was an entry in the Gentoo Wiki - Use memory on video card as swap - but that approach forces you to use the vesa driver and you can't map the whole GPU RAM like that at least on my GTX 260 anyway.

I came to the conclusion that the most generic way would be to expose the extra resources via a block device.

Next up was actually figuring out how to do that. I have immediately thought about CUDA as I have had some contact with it before and I knew you can easily manage GPU RAM with it. And it's also possible to use it without disrupting the normal chores of your GPU - like actually displaying something on your monitor.

Sounds perfect, right? The only problem is that both CUDA toolkit and the NVIDIA drivers are closed-source and their interactions aren't documented anywhere. The only API they provide is in userspace and hence accessing it from kernel isn't easily doable. One could try and reverse-engineer the internal API, but I didn't want to go there with my first project especially as both the toolkit and drivers are constantly evolving and surely changing the API along the way.
I ended up deciding to be nice and use the CUDA userspace API. It complicates the design, but that actually might be a plus given that it's supposed to be a learning project. The final design follows:

cudaram kernel module <-> cudaramd userspace daemon <-> CUDA toolkit <-> nvidia kernel module

Basically, it is a block device with its storage implemented in userspace. There are similar things out there - like NBD and ABUSE. There is also FUSE, but that's at different level.
I decided to write my own module for two reasons, firstly I wanted to learn as much as possible and secondly it gives me the most flexibility should I need it later.

And so I did. I have pushed the code to https://github.com/peper/cudaram. There is a basic README included in the repo too if you are brave enough to try it out :) I wouldn't necessarily recommend that as at this point I would like to mostly gather feedback on my implementation.

Nevertheless it does seem to work and it's pretty fast at least for some loads:
$ mkfs.ext2 /dev/cudaram0
...
$ mount /dev/cudaram0 /mnt/cuda

# /mnt/tmpfs/foo is a 250MB file in tmpfs

# copy from tmpfs to cudaram
$ dd if=/mnt/tmpfs/foo of=/mnt/cuda/foo bs=$((1000*1000)) count=250 conv=fdatasync
250000000 bytes (250 MB) copied, 0.296378 s, 844 MB/s

# copy from cudaram to tmpfs
$ echo 3 > /proc/sys/vm/drop_caches
$ dd if=/mnt/cuda/foo of=/mnt/tmpfs/foo bs=$((1000*1000)) count=250 conv=fdatasync
250000000 bytes (250 MB) copied, 0.275168 s, 909 MB/s

# copy from tmpfs to tmpfs
$ dd if=/mnt/tmpfs/foo of=/mnt/tmpfs/foo2 bs=$((1000*1000)) count=250 conv=fdatasync
250000000 bytes (250 MB) copied, 0.13663 s, 1.8 GB/s
So cudaram is about 2 times slower than tmpfs at copying one big file. Doesn't seem too bad at all for the first version. What helps it here is that in this load it's getting pretty big I/O requests. Where it might hurt is a lot of small requests - that should be obvious after reading the following overview of the implementation.

Currently the cudaram module creates a few cudaramX block devices with matching cudaramctlX control devices. The cudaramd daemon allocates the GPU RAM and a transfer buffer and starts communicating with the cudaram module via ioctl()s on the control device.

After initialization the flow is as follows:
  • ioctl() call start
  • Submit the last completed I/O request
  • If the I/O request direction was READ then the module copies the data from the transfer buffer
  • The module marks the request as completed
  • Sleep waiting for more requests
  • If there are pending I/O requests, take the first one from the queue
  • If the I/O request direction is WRITE then copy the data to the transfer buffer
  • Return the data required to complete the request (sector number etc.)
  • ioctl() call end
  • Perform the request, i.e. copy data between the GPU RAM and the transfer buffer
  • Start over
The I/O requests are queued asynchronously by the block device susbsytem.

For any more details I will have to redirect you to the source code.

TODO:
  • Figure out whether making swap to cudaram work is possible - currently it can deadlock! Might be especially tricky given that the nvidia driver is closed-source
  • Allocating GPU RAM in smaller chunks - to avoid fragmentation problems
  • Allocating GPU RAM on demand
  • Test different userspace-kernel communication schemes - e.g. vmapping the userspace buffer, adding separate read/write buffers, etc.
  • Make it more user-friendly
That's it for now. I would really appreciate all kinds of feedback, but especially a code review from kernel hackers :)

9 comments:

  1. Damn cool! Nice hack. :)

    ReplyDelete
  2. Yay! At last a use for the expensive but unusable "nvidia Optimus" GPU in my laptop. It can't be connected to the screen, but at least its RAM can back /tmp.

    ...but wait, nvidia doesn't even support CUDA on these GPUs. Bugger.

    ReplyDelete
  3. If your Nvidia Optimus is not being used at all, you can access its RAM in the old way, as the Gentoo Wiki link explains. You can use the entire graphics RAM, if nothing needs to be reserved for X.

    Also, in my experience, VESA is not the only driver that can limit the graphics memory. I remember using at least nv for basic 2D acceleration.

    ReplyDelete
  4. Fun idea. Let me know when it's stable, I've got a couple of Teslas around here that could do this when they're not running calculations.

    ReplyDelete
  5. Will probably do a second post when it's more stable.
    Btw. I think I know how to make swapping work as well. Just need to find some time to actually implement it.

    ReplyDelete
  6. I dont know much about both openCL and cuda, but something tells me, that openCL wouldnt be much worse than cuda in memory managment, so why not make whole thing in openCL(providing you know it), that would benefit everyone and not just proprietary nvidia users?

    ReplyDelete
  7. Great stuff you are doing! If you can make it faster and maybe also make it use all the available cuda devices then it might really be useful!

    ReplyDelete
  8. Hi there, I'm able to load the module but when I try to start the daemon I get this error message:

    [ERR] Failed to created the cuda context

    How can I fix this?

    ReplyDelete