Vision

Vision-capable models accept image inputs alongside text. ai& supports three input modes; choose the one that fits your use case.

Three ways to send an image

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What's in this image?" },
    { "type": "image_url", "image_url": { "url": "https://example.com/cat.png" } }
  ]
}

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What's in this image?" },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,iVBORw0KG..." } }
  ]
}

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What's in this image?" },
    { "type": "file", "file": { "file_id": "file-abc123" } }
  ]
}

The file_id path

When you send {type: "file", file: {file_id}}, the model receives the bytes you uploaded — no public hosting required. The wire format on your end stays the same regardless of which mode you use.

Upload images via Files with purpose: "vision" — or omit the purpose and let it be inferred from MIME.

Supported MIME types: image/png, image/jpeg, image/webp, image/gif. Max size: 100 MB.

Capability gating

Vision requests are rejected if the target model lacks the vision capability — both for inline image_url parts and for file_id references whose purpose maps to vision. List models and their capabilities via Models.