图片转提示词功能——分析图像并生成AI就绪的文本提示词的能力——已经从新奇功能变成了数十个创意工具的核心特性。设计应用用它进行风格匹配。浏览器扩展用它分析用户正在查看的图像。内容管道用它批量处理视觉资产。

如果你想将这种能力添加到自己的应用中,你有三个严肃的选择:Claude Vision(Anthropic)、GPT-4V(OpenAI)和Gemini Vision(Google)。本指南涵盖了每个的技术集成,包含实际可运行的代码,并提供能生成高质量结构化输出的提示词模板。

为什么开发者需要图片转提示词API

使用场景比你初次想象的要广泛得多:

三大主要API选择

API提供商图像输入价格(大约)最适合
Claude Vision (claude-haiku-4-5 / claude-sonnet)AnthropicBase64或URL$0.25–$3 / 1M输入token详细提示词、指令跟随
GPT-4V / GPT-4oOpenAIBase64或URL$2.50–$10 / 1M输入token广泛知识、对话集成
Gemini Vision (1.5 Flash/Pro)GoogleBase64、URL或GCS$0.075–$3.50 / 1M输入token高吞吐量、成本敏感工作负

对于图片转提示词具体来说,Claude Vision倾向于生成最结构化、最具艺术指导意识的输出。如果你已经在使用OpenAI生态系统,GPT-4o是最通用的选择。Gemini 1.5 Flash是高吞吐量管道中最经济的选项,适合成本比质量更重要的场景。

Claude Vision API:设置和认证

首先,安装Anthropic SDK:

pip install anthropic

或者对于Node.js:

npm install @anthropic-ai/sdk

console.anthropic.com获取API密钥,并设置为环境变量:

export ANTHROPIC_API_KEY="sk-ant-api03-..."

图像编码:Base64 vs URL

Claude Vision接受两种格式的图像:

Base64(推荐用于本地文件):将图像编码为Base64字符串,直接包含在API请求中。支持JPEG、PNG、GIF和WebP。每张图像最大5MB。

URL(推荐用于网络托管图像):传入公开可访问的图像URL。Claude会在服务器端获取图像。URL必须是公开可访问的(不需要认证)。

性能提示:对于生产环境工作负载,Base64提供更可预测的延迟,因为你不依赖于下游URL的速度或可用性。对于原型开发,URL更简单。

生成AI就绪提示词的系统提示词模板

系统提示词是你的实现中最重要的部分。弱的系统提示词会产生通用的图像描述。精心设计的系统提示词会产生结构化、可操作的提示词,能很好地与图像生成模型配合。

以下是能产生高质量Midjourney/Stable Diffusion兼容输出的系统提示词模板:

You are an expert AI art prompt engineer. When given an image, analyze it thoroughly and generate an optimized text prompt that would recreate a similar image using an AI image generation tool.

Your output must follow this structure:

PROMPT:
[Single paragraph, comma-separated descriptors. Include: subject description, environment/setting, lighting conditions, color palette, mood/atmosphere, art style, rendering quality. Be specific and concrete. 60-120 words.]

STYLE NOTES:
[2-3 sentences describing the visual style, medium, and any distinctive aesthetic characteristics]

NEGATIVE PROMPT:
[Comma-separated list of elements to exclude for best results]

MODEL RECOMMENDATION:
[One of: Midjourney, Stable Diffusion, DALL·E 3, Flux, Ideogram — based on the image style]

Do not include any preamble, explanation, or text outside this structure.

Python实现:Claude Vision

import anthropic
import base64
from pathlib import Path

def image_to_prompt(image_path: str, target_model: str = "Midjourney") -> dict:
    """
    Analyze an image and generate an AI art prompt.

    Args:
        image_path: Path to local image file
        target_model: Target AI model ("Midjourney", "Stable Diffusion", etc.)

    Returns:
        dict with 'prompt', 'style_notes', 'negative_prompt', 'model_recommendation'
    """
    client = anthropic.Anthropic()

    # Read and encode image
    image_data = Path(image_path).read_bytes()
    base64_image = base64.standard_b64encode(image_data).decode("utf-8")

    # Detect media type
    suffix = Path(image_path).suffix.lower()
    media_type_map = {
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".png": "image/png",
        ".gif": "image/gif",
        ".webp": "image/webp"
    }
    media_type = media_type_map.get(suffix, "image/jpeg")

    system_prompt = """You are an expert AI art prompt engineer. Analyze the provided image and generate an optimized prompt for AI image generation.

Output this exact structure:

PROMPT:
[60-120 word comma-separated prompt covering subject, environment, lighting, colors, mood, style, quality tokens]

STYLE NOTES:
[2-3 sentences on visual style and distinctive characteristics]

NEGATIVE PROMPT:
[Comma-separated exclusion list]

MODEL RECOMMENDATION:
[Best model: Midjourney, Stable Diffusion, DALL·E 3, Flux, or Ideogram]"""

    message = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        system=system_prompt,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": base64_image,
                        },
                    },
                    {
                        "type": "text",
                        "text": f"Analyze this image and generate an AI art prompt optimized for {target_model}."
                    }
                ],
            }
        ],
    )

    # Parse response
    response_text = message.content[0].text
    return parse_prompt_response(response_text)


def parse_prompt_response(text: str) -> dict:
    """Parse structured response into a dictionary."""
    result = {}
    sections = {
        "prompt": "PROMPT:",
        "style_notes": "STYLE NOTES:",
        "negative_prompt": "NEGATIVE PROMPT:",
        "model_recommendation": "MODEL RECOMMENDATION:"
    }

    lines = text.split("\n")
    current_section = None
    current_content = []

    for line in lines:
        line = line.strip()
        matched = False
        for key, header in sections.items():
            if line.startswith(header):
                if current_section:
                    result[current_section] = " ".join(current_content).strip()
                current_section = key
                remainder = line[len(header):].strip()
                current_content = [remainder] if remainder else []
                matched = True
                break
        if not matched and current_section and line:
            current_content.append(line)

    if current_section:
        result[current_section] = " ".join(current_content).strip()

    return result


# Example usage
if __name__ == "__main__":
    result = image_to_prompt("reference_photo.jpg", target_model="Midjourney")
    print("PROMPT:", result.get("prompt"))
    print("NEGATIVE:", result.get("negative_prompt"))
    print("RECOMMENDED MODEL:", result.get("model_recommendation"))

JavaScript / Node.js实现

import Anthropic from "@anthropic-ai/sdk";
import fs from "fs";
import path from "path";

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

const SYSTEM_PROMPT = `You are an expert AI art prompt engineer. Analyze the provided image and generate an optimized prompt for AI image generation.

Output this exact structure:

PROMPT:
[60-120 word comma-separated prompt covering subject, environment, lighting, colors, mood, style, quality tokens]

STYLE NOTES:
[2-3 sentences on visual style and distinctive characteristics]

NEGATIVE PROMPT:
[Comma-separated exclusion list]

MODEL RECOMMENDATION:
[Best model: Midjourney, Stable Diffusion, DALL·E 3, Flux, or Ideogram]`;

async function imageToPrompt(imagePath, targetModel = "Midjourney") {
  const imageBuffer = fs.readFileSync(imagePath);
  const base64Image = imageBuffer.toString("base64");

  const ext = path.extname(imagePath).toLowerCase();
  const mediaTypeMap = {
    ".jpg": "image/jpeg",
    ".jpeg": "image/jpeg",
    ".png": "image/png",
    ".gif": "image/gif",
    ".webp": "image/webp",
  };
  const mediaType = mediaTypeMap[ext] || "image/jpeg";

  const message = await client.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 1024,
    system: SYSTEM_PROMPT,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "base64",
              media_type: mediaType,
              data: base64Image,
            },
          },
          {
            type: "text",
            text: `Analyze this image and generate an AI art prompt optimized for ${targetModel}.`,
          },
        ],
      },
    ],
  });

  const responseText = message.content[0].text;
  return parsePromptResponse(responseText);
}

function parsePromptResponse(text) {
  const result = {};
  const sections = {
    prompt: "PROMPT:",
    style_notes: "STYLE NOTES:",
    negative_prompt: "NEGATIVE PROMPT:",
    model_recommendation: "MODEL RECOMMENDATION:",
  };

  let currentSection = null;
  let currentLines = [];

  for (const rawLine of text.split("\n")) {
    const line = rawLine.trim();
    let matched = false;

    for (const [key, header] of Object.entries(sections)) {
      if (line.startsWith(header)) {
        if (currentSection) {
          result[currentSection] = currentLines.join(" ").trim();
        }
        currentSection = key;
        const remainder = line.slice(header.length).trim();
        currentLines = remainder ? [remainder] : [];
        matched = true;
        break;
      }
    }

    if (!matched && currentSection && line) {
      currentLines.push(line);
    }
  }

  if (currentSection) {
    result[currentSection] = currentLines.join(" ").trim();
  }

  return result;
}

// Example usage
const result = await imageToPrompt("./reference.png", "Stable Diffusion");
console.log("Prompt:", result.prompt);
console.log("Negative:", result.negative_prompt);

使用URL而非Base64

async function imageToPromptFromUrl(imageUrl, targetModel = "Midjourney") {
  const message = await client.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 1024,
    system: SYSTEM_PROMPT,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "url",
              url: imageUrl,
            },
          },
          {
            type: "text",
            text: `Analyze this image and generate an AI art prompt optimized for ${targetModel}.`,
          },
        ],
      },
    ],
  });

  return parsePromptResponse(message.content[0].text);
}

错误处理:速率限制、图像大小限制、不支持的格式

生产环境实现必须优雅地处理这些失败场景:

async function imageToPromptSafe(imagePath, targetModel = "Midjourney") {
  // Check file size before upload (5MB limit)
  const stats = fs.statSync(imagePath);
  const fileSizeMB = stats.size / (1024 * 1024);
  if (fileSizeMB > 5) {
    throw new Error(`Image too large: ${fileSizeMB.toFixed(1)}MB. Max 5MB.`);
  }

  // Check format
  const ext = path.extname(imagePath).toLowerCase();
  const supported = [".jpg", ".jpeg", ".png", ".gif", ".webp"];
  if (!supported.includes(ext)) {
    throw new Error(`Unsupported format: ${ext}. Use JPG, PNG, GIF, or WebP.`);
  }

  try {
    return await imageToPrompt(imagePath, targetModel);
  } catch (error) {
    if (error.status === 429) {
      // Rate limited — implement exponential backoff
      const retryAfter = parseInt(error.headers?.["retry-after"] || "60");
      console.warn(`Rate limited. Retry after ${retryAfter}s`);
      await new Promise(resolve => setTimeout(resolve, retryAfter * 1000));
      return await imageToPrompt(imagePath, targetModel); // single retry
    }
    if (error.status === 400) {
      throw new Error(`Bad request — check image format and size: ${error.message}`);
    }
    throw error;
  }
}
ImageToPrompt工具界面——展示上传区域、模型选择和生成的提示词输出
ImageToPrompt Web界面——基于本指南描述的相同Claude Vision API方法构建。

成本对比:Claude vs GPT-4V vs Gemini Vision

图像token在不同提供商之间的计算方式不同。以下是典型1024x1024 JPEG(~300KB)的实际对比:

API图像Token成本文本输出成本每1000张图像成本
Claude Haiku 4.5~1,600输入token~300输出token~$0.40
Claude Sonnet 3.5~1,600输入token~300输出token~$3.60
GPT-4o mini~765输入token(低详细度)~300输出token~$0.23
GPT-4o~765输入token~300输出token~$2.70
Gemini 1.5 Flash~258输入token~300输出token~$0.08
Gemini 1.5 Pro~258输入token~300输出token~$0.90

价格为2026年3月的估算值。请在各提供商网站核实当前价格。

对于每天处理数百张图像的消费者应用,Gemini Flash或GPT-4o mini最具成本效益。对于提示词准确性至关重要的高质量管道,Claude Haiku提供了最佳的质量成本比。

缓存策略以降低API成本

对于生产部署,缓存至关重要。同一张图像绝不应被分析两次:

import crypto from "crypto";
import Redis from "ioredis";

const redis = new Redis(process.env.REDIS_URL);
const CACHE_TTL = 60 * 60 * 24 * 30; // 30 days

async function imageToPromptCached(imageBuffer, targetModel = "Midjourney") {
  // Create cache key from image content hash + model
  const imageHash = crypto
    .createHash("sha256")
    .update(imageBuffer)
    .digest("hex");
  const cacheKey = `img2prompt:${imageHash}:${targetModel}`;

  // Check cache
  const cached = await redis.get(cacheKey);
  if (cached) {
    return JSON.parse(cached);
  }

  // Generate prompt
  const base64Image = imageBuffer.toString("base64");
  const result = await imageToPrompt(base64Image, targetModel);

  // Cache result
  await redis.setex(cacheKey, CACHE_TTL, JSON.stringify(result));

  return result;
}

生产架构:Serverless函数模式

对于Web应用来说,最实用的生产模式是一个接受图像数据并返回提示词的serverless函数。这是ImageToPrompt本身使用的模式,部署在Vercel上:

// api/analyze.ts (Vercel serverless function)
import type { VercelRequest, VercelResponse } from "@vercel/node";
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

export default async function handler(req: VercelRequest, res: VercelResponse) {
  if (req.method !== "POST") {
    return res.status(405).json({ error: "Method not allowed" });
  }

  // CORS
  res.setHeader("Access-Control-Allow-Origin", process.env.ALLOWED_ORIGIN || "*");

  const { image, mediaType, targetModel } = req.body;

  if (!image || !mediaType) {
    return res.status(400).json({ error: "image and mediaType are required" });
  }

  try {
    const message = await client.messages.create({
      model: "claude-haiku-4-5",
      max_tokens: 1024,
      system: SYSTEM_PROMPT,
      messages: [
        {
          role: "user",
          content: [
            {
              type: "image",
              source: { type: "base64", media_type: mediaType, data: image },
            },
            {
              type: "text",
              text: `Generate an AI art prompt optimized for ${targetModel || "Midjourney"}.`,
            },
          ],
        },
      ],
    });

    const parsed = parsePromptResponse(message.content[0].text);
    return res.status(200).json(parsed);
  } catch (error: unknown) {
    const err = error as { status?: number; message?: string };
    if (err.status === 429) {
      return res.status(429).json({ error: "Rate limit exceeded" });
    }
    return res.status(500).json({ error: "Analysis failed" });
  }
}

实际使用案例

批量处理图像目录

import { glob } from "glob";
import { writeFileSync } from "fs";

async function batchProcess(directory, targetModel = "Midjourney") {
  const images = await glob(`${directory}/**/*.{jpg,jpeg,png,webp}`);
  const results = [];

  for (const imagePath of images) {
    try {
      console.log(`Processing: ${imagePath}`);
      const result = await imageToPromptSafe(imagePath, targetModel);
      results.push({ file: imagePath, ...result });

      // Respectful rate limiting: 1 request per second
      await new Promise(resolve => setTimeout(resolve, 1000));
    } catch (error) {
      console.error(`Failed: ${imagePath}`, error.message);
      results.push({ file: imagePath, error: error.message });
    }
  }

  writeFileSync("prompts.json", JSON.stringify(results, null, 2));
  console.log(`Processed ${results.length} images`);
}

batchProcess("./assets/reference-images", "Stable Diffusion");

浏览器扩展模式

对于分析用户正在查看的图像的浏览器扩展,你需要从页面获取图像,在内容脚本中将其转换为Base64,发送到你的后端serverless函数(以保持API密钥在服务器端),然后在弹窗中显示返回的提示词。

关键注意事项:永远不要在客户端JavaScript中暴露你的Anthropic API密钥。始终通过后端函数代理,在转发到Anthropic API之前验证请求。

ImageToPrompt作为参考实现

ImageToPrompt(本站的工具)是这种模式的工作参考实现。它使用Claude Haiku进行图像分析,Vercel serverless函数作为后端,Upstash Redis进行速率限制(免费层每个IP每天10次请求),以及React前端处理图像上传、预览和提示词显示。

完整的架构——serverless函数、速率限制、CORS处理、错误响应——代表了一个生产就绪的模式,你可以根据自己的使用场景进行调整。如果你正在构建类似的东西,研究一个工作中的实现如何处理边缘情况(大文件、不支持的格式、并发请求)比任何教程都更有价值。

常见问题

图片转提示词最好的API是哪个?

对于图片转提示词,Claude Vision通常生成最结构化、最具艺术指导意识的输出。GPT-4o在OpenAI生态系统中最通用。Gemini 1.5 Flash是高吞吐量管道中最经济的选择。

图片转提示词每张图像的API调用成本是多少?

成本因提供商而异。Claude Haiku大约每1000张图像$0.40,GPT-4o mini约$0.23,Gemini 1.5 Flash最便宜,约$0.08。这些是典型1024x1024 JPEG的估算值。

应该使用Base64还是URL发送图像给API?

Base64推荐用于生产环境,因为延迟更可预测。URL更适合原型开发和快速测试。每张图像最大限制5MB。

如何降低图片转提示词API的成本?

使用基于图像内容哈希的缓存策略,避免重复分析同一张图像。对于高吞吐量场景,可以选择Gemini Flash或GPT-4o mini。对于质量优先的场景,Claude Haiku提供了最佳的性价比。