Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
burtenshaw 
posted an update 3 days ago
Post
1588
Still speed running Gemma 3 to think. Today I focused on setting up gpu poor hardware to run GRPO.

This is a plain TRL and PEFT notebook which works on mac silicone or colab T4. This uses the 1b variant of Gemma 3 and a reasoning version of GSM8K dataset.

🧑‍🍳 There’s more still in the oven like releasing models, an Unsloth version, and deeper tutorials, but hopefully this should bootstrap your projects.

Here’s a link to the 1b notebook: https://colab.research.google.com/drive/1mwCy5GQb9xJFSuwt2L_We3eKkVbx2qSt?usp=sharing

Hi,

Thanks for the notebook.

I tried running this notebook on my M2 Mac, but it doesn’t work due to a bitsandbytes issue. The error I’m getting is:

AssertionError: Torch not compiled with CUDA enabled

Since bitsandbytes doesn’t support Apple Silicon yet (?), it crashes when trying to call torch.cuda.current_device(). My setup confirms that MPS is available, but the notebook still fails:

import torch
print(torch.mps.is_available())  # Returns True

I also checked my installed package versions with:

uv pip freeze | grep -E "bitsandbytes|transformers|trl|peft|torch"
>>>
bitsandbytes==0.42.0
peft @ git+https://github.com/huggingface/peft.git@7320bb94a04f32dd576c8952cb4d4f59f8fc5a1b
torch==2.8.0.dev20250314
torchaudio==2.6.0.dev20250314
torchvision==0.22.0.dev20250314
transformers @ git+https://github.com/huggingface/transformers@46350f5eae87ac1d168ddfdc57a0b39b64b9a029
trl @ git+https://github.com/huggingface/trl.git@5cb390cd306b6b6aedf6403e6572c62bc33f96db

I believe the issue is coming from bitsandbytes, which isn’t compatible with MPS?

Is there a recommended workaround for running this on a Mac without CUDA?

Thanks.

Truncated error message:

File ~/playgrounds/.venv/lib/python3.12/site-packages/bitsandbytes/functional.py:1395, in optimizer_update_8bit_blockwise(optimizer_name, g, p, state1, state2, beta1, beta2, eps, step, lr, qmap1, qmap2, absmax1, absmax2, weight_decay, gnorm_scale, skip_zeros)
   1374 def optimizer_update_8bit_blockwise(
   1375     optimizer_name: str,
   1376     g: Tensor,
   (...)
   1391     skip_zeros=False,
   1392 ) -> None:
   1394     optim_func = None
-> 1395     prev_device = pre_call(g.device)
   1396     is_on_gpu([g, p, state1, state2, qmap1, qmap2, absmax1, absmax2])
   1397     if g.dtype == torch.float32 and state1.dtype == torch.uint8:

File ~/playgrounds/.venv/lib/python3.12/site-packages/bitsandbytes/functional.py:416, in pre_call(device)
    415 def pre_call(device):
--> 416     prev_device = torch.cuda.current_device()
    417     torch.cuda.set_device(device)
    418     return prev_device

File ~/playgrounds/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:1026, in current_device()
   1024 def current_device() -> int:
   1025     r"""Return the index of a currently selected device."""
-> 1026     _lazy_init()
   1027     return torch._C._cuda_getDevice()

File ~/playgrounds/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:363, in _lazy_init()
    358     raise RuntimeError(
    359         "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
    360         "multiprocessing, you must use the 'spawn' start method"
    361     )
    362 if not hasattr(torch._C, "_cuda_getDeviceCount"):
--> 363     raise AssertionError("Torch not compiled with CUDA enabled")
    364 if _cudart is None:
    365     raise AssertionError(
    366         "libcudart functions unavailable. It looks like you have a broken build?"
    367     )

AssertionError: Torch not compiled with CUDA enabled