Tuesday, March 1, 2011

cudaram - a block device exposing NVIDIA GPUs' RAM implemented with CUDA

I have been looking for a sillylearning project for Linux kernel for some time and I think I have finally come up with something suitable.
Why not use the extra free RAM on your GPU for something useful while not hindering your normal GPU use (vdpau, desktop effects etc.)? I started looking and I didn't really find anything of the sort, the closest was an entry in the Gentoo Wiki - Use memory on video card as swap - but that approach forces you to use the vesa driver and you can't map the whole GPU RAM like that at least on my GTX 260 anyway.

I came to the conclusion that the most generic way would be to expose the extra resources via a block device.

Next up was actually figuring out how to do that. I have immediately thought about CUDA as I have had some contact with it before and I knew you can easily manage GPU RAM with it. And it's also possible to use it without disrupting the normal chores of your GPU - like actually displaying something on your monitor.

Sounds perfect, right? The only problem is that both CUDA toolkit and the NVIDIA drivers are closed-source and their interactions aren't documented anywhere. The only API they provide is in userspace and hence accessing it from kernel isn't easily doable. One could try and reverse-engineer the internal API, but I didn't want to go there with my first project especially as both the toolkit and drivers are constantly evolving and surely changing the API along the way.
I ended up deciding to be nice and use the CUDA userspace API. It complicates the design, but that actually might be a plus given that it's supposed to be a learning project. The final design follows:

cudaram kernel module <-> cudaramd userspace daemon <-> CUDA toolkit <-> nvidia kernel module

Basically, it is a block device with its storage implemented in userspace. There are similar things out there - like NBD and ABUSE. There is also FUSE, but that's at different level.
I decided to write my own module for two reasons, firstly I wanted to learn as much as possible and secondly it gives me the most flexibility should I need it later.

And so I did. I have pushed the code to https://github.com/peper/cudaram. There is a basic README included in the repo too if you are brave enough to try it out :) I wouldn't necessarily recommend that as at this point I would like to mostly gather feedback on my implementation.

Nevertheless it does seem to work and it's pretty fast at least for some loads:
$ mkfs.ext2 /dev/cudaram0
...
$ mount /dev/cudaram0 /mnt/cuda

# /mnt/tmpfs/foo is a 250MB file in tmpfs

# copy from tmpfs to cudaram
$ dd if=/mnt/tmpfs/foo of=/mnt/cuda/foo bs=$((1000*1000)) count=250 conv=fdatasync
250000000 bytes (250 MB) copied, 0.296378 s, 844 MB/s

# copy from cudaram to tmpfs
$ echo 3 > /proc/sys/vm/drop_caches
$ dd if=/mnt/cuda/foo of=/mnt/tmpfs/foo bs=$((1000*1000)) count=250 conv=fdatasync
250000000 bytes (250 MB) copied, 0.275168 s, 909 MB/s

# copy from tmpfs to tmpfs
$ dd if=/mnt/tmpfs/foo of=/mnt/tmpfs/foo2 bs=$((1000*1000)) count=250 conv=fdatasync
250000000 bytes (250 MB) copied, 0.13663 s, 1.8 GB/s
So cudaram is about 2 times slower than tmpfs at copying one big file. Doesn't seem too bad at all for the first version. What helps it here is that in this load it's getting pretty big I/O requests. Where it might hurt is a lot of small requests - that should be obvious after reading the following overview of the implementation.

Currently the cudaram module creates a few cudaramX block devices with matching cudaramctlX control devices. The cudaramd daemon allocates the GPU RAM and a transfer buffer and starts communicating with the cudaram module via ioctl()s on the control device.

After initialization the flow is as follows:
  • ioctl() call start
  • Submit the last completed I/O request
  • If the I/O request direction was READ then the module copies the data from the transfer buffer
  • The module marks the request as completed
  • Sleep waiting for more requests
  • If there are pending I/O requests, take the first one from the queue
  • If the I/O request direction is WRITE then copy the data to the transfer buffer
  • Return the data required to complete the request (sector number etc.)
  • ioctl() call end
  • Perform the request, i.e. copy data between the GPU RAM and the transfer buffer
  • Start over
The I/O requests are queued asynchronously by the block device susbsytem.

For any more details I will have to redirect you to the source code.

TODO:
  • Figure out whether making swap to cudaram work is possible - currently it can deadlock! Might be especially tricky given that the nvidia driver is closed-source
  • Allocating GPU RAM in smaller chunks - to avoid fragmentation problems
  • Allocating GPU RAM on demand
  • Test different userspace-kernel communication schemes - e.g. vmapping the userspace buffer, adding separate read/write buffers, etc.
  • Make it more user-friendly
That's it for now. I would really appreciate all kinds of feedback, but especially a code review from kernel hackers :)