Introducing Llama Vision-Instruct Models with DigitalOcean 1-Click GPU Droplets

Community Article Published March 14, 2025

Meta Llama 3.2 Vision Instruct models are now available as 1-Click Models on DigitalOcean GPU Droplets, thanks to the collaboration with Hugging Face. These models extend the capabilities of the powerful Llama Large Language Models to the visual modality, enabling them to make conjectures and observations about image data with textual outputs. When powered by DigitalOcean’s GPU Droplets, we can take advantage of these powerful model’s capabilities at speeds faster than ever on NVIDIA GPUs.

DigitalOcean is a cloud infrastructure-as-a-service (IaaS) provider that offers simple, affordable, and scalable cloud computing solutions for developers, startups, and small businesses, focusing on ease of use and rapid deployment. With their powerful 1-Click model GPU Droplets, users can deploy powerful LLMs at scale with no setup involved on the cloud at affordable prices and on reliable infrastructure.

Follow along for an introduction to vision LLMs, to learn about the strengths of the Llama 3.2 Vision, and a demonstration on how to get started with the Llama 3.2 Vision Models on a 1-Click GPU Droplet using doctl, the official command line interface for the DigitalOcean API.

What are Vision-Instruct LLMs with 1-Click GPU Droplets

Vision-Instruct LLMs are, in short, LLMs that are capable of interacting with both text and image data and then using the understanding of both to generate meaningful outputs. As the authors at Meta put it, these models excel at “visual recognition, image reasoning, captioning, and answering general questions about an image.” (Source https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)

With 1-Click model GPU Droplets, these models are automatically deployed and configured for the user without any setup on the users part. Once the GPU Droplet is spun up, users can interact with the deployed models using cURL, the Python requests library, or the OpenAI library syntax to interact with both text and image data. Notably, this enables internet connectivity, and the ability to source image data from the web.

Learn more about the newly available Vision-Instruct models on GPU Droplets at the marketplace:

The Llama 3.2 Vision-Instruct models

Released in late 2024, these models are the second most recent iteration of the Llama series of Large Language Models. This release was the first of this family of GPT models to showcase the ability to handle image data along with text inputs. Let’s take a look at the model family, and compare the two sizes available.

	Training Data	Params	Input modalities	Output modalities	Context length	GQA	Data volume	Knowledge cutoff
Llama 3.2-Vision	(Image, text) pairs	11B (10.6)	Text + Image	Text	128k	Yes	6B (image, text) pairs	December 2023
Llama 3.2-Vision	(Image, text) pairs	90B (88.8)	Text + Image	Text	128k	Yes	6B (image, text) pairs	December 2023

As we can see from the table above, two versions of the model were released: with 11b and 90b parameters, respectively. They can each handle the same context length of 128k tokens, and feature grouped query attention to accelerate inference. Additionally, their cutoff for information is December of 2023.

Use cases for Vision-Instruct models

There are uncountable use cases for an LLM that can handle both image and visual data. With fine-tuning, they can even be optimized further for better performance, better acuity, and more relevant understanding of specific contexts. Let’s explore some of the possibilities.

Image captioning: these models can be used to derive a textual description of image data. Furthermore, the models can be used to iteratively caption images at scale
Visual Question Answering (VQA): can answer questions about images, incorporating higher order understanding of the image with the models inherent reasoning capabilities
Object Recognition and Classification: the model can identify individual objects depicted in the image, and classify them by object type. It can do this without any additional training
Spatial reasoning: the ability to understand and describe the relative positioning of objects in an image
Document understanding: the ability to read and understand documents as images, such as PDFs or rich text files. This is paired with the native comprehension capability of the LLM to provide analysis of the contents

How to Create a Vision Model 1-Click GPU Droplet with doctl

Download and install doctl

doctl is the DigitalOcean API Command line interface, allowing the user to take action with DigitalOcean products from their local terminal. We will use doctl to create the 1-Click model GPU Droplet.

Installing doctl is simple. Just follow the guidance provided in the official documentation.

To authorize doctl for your DigitalOcean account, we first need to generate an API key. To do so, open up the DigitalOcean cloud console in your browser. In the left menu, click API, which takes you to the Applications & API page on the Tokens tab. In the Personal access tokens section, click the Generate New Token button. Save the API Key to your clipboard.

After that is done, we need to run the authorization command. Replace the value with whatever name you would like to use for the team you are authorizing access to.:

doctl auth init --context <name>

Once that is done, you will be prompted to paste the API key we saved earlier. That will finalize your authorization to your account.

Create a 1-Click model GPU Droplet with doctl

Now that we have authorized our account, we can create our GPU Droplet using doctl. This is actually very simple. All we need to do is make sure that we have created an SSH key for the connection to the remote server. You can do this by following the guide here. Once you have done that, save the name of your key. Copy the following command with the SSH key name in place of below. Paste it into the terminal.

doctl compute droplet create test-droplet --image 172179971 --region nyc2 --size gpu-h100x1-80gb --ssh-keys

Interact with your 1-click model with cURL

When interacting with the model from our terminal, we can use cURL, Python requests, or OpenAI's Python syntax. Learn more about the different ways to interact with the deployed model here. For this demonstration, we are going to use cURL.

curl http://localhost:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $BEARER_TOKEN" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
            }
          },
          {
            "type": "text",
            "text": "Describe this image in one sentence."
          }
        ]
      }
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 128
  }

Closing Thoughts

The potential of vision instruct models is truly limitless. Together, DigitalOcean and HuggingFace abstract away the complexity so that you can focus on building. With the simplicity of this solution and the depth of platforms behind it, you can deploy Llama Vision Instruct models in a matter of minutes and be onto building you AI applications. We encourage you to try out the 1-Click model GPU Droplets on DigitalOcean!

Be sure to explore the DigitalOcean organization for more information about the available models on GPU Droplets!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote