← 返回
未分类 中文

NVIDIA LocateAnything-3B vision-language grounding model

NVIDIA LocateAnything-3B vision-language grounding model. Covers inference API (detect/ground/point/detect_text/ground_gui), data preparation (JSONL+Recipe 8...
NVIDIA LocateAnything-3B vision-language grounding model. Covers inference API (detect/ground/point/detect_text/ground_gui), data preparation (JSONL+Recipe 8...
openlark openlark 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 124
下载
💾 0
安装
1
版本
#latest

概述

LocateAnything — Vision-Language Grounding

NVIDIA Eagle family VLM, based on Parallel Box Decoding (PBD) for single-step parallel prediction of complete coordinates. 12.7 BPS (H100) ≈ 10× Qwen3-VL.

Architecture: MoonViT-SO-400M → MLP → Qwen2.5-3B → PBD

Installation

git clone https://github.com/NVlabs/Eagle eagle && cd eagle/Embodied
pip install -e .
# Optional MagiAttention (Hopper/Blackwell only, long sequences 32K+):
# git clone https://github.com/SandAI-org/MagiAttention.git && cd MagiAttention && git checkout v1.0.5
# git submodule update --init --recursive && pip install --no-build-isolation .

Inference API

from locateanything_worker import LocateAnythingWorker
from PIL import Image

worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("e.jpg").convert("RGB")

worker.detect(img, ["person", "car"])                    # Object detection
worker.ground_single(img, "the red car")                  # Single-instance grounding
worker.ground_multi(img, "people wearing hats")           # Multi-instance grounding
worker.detect_text(img)                                   # OCR
worker.ground_gui(img, "search button")                   # GUI box
worker.ground_gui(img, "search", output_type="point")     # GUI point
worker.point(img, "the traffic light")                    # Point grounding

# Low-level: worker.predict(image, question, generation_mode="hybrid")
# mode: fast(MTP) | slow(NTP) | hybrid(default)

Output Parsing

Box: <ref>label</ref><box><x1><y1><x2><y2></box>
Point: <box><x><y></box>
Empty: <box>none</box>

Coordinates [0,1000] integers, divide by 1000 for relative coordinates.

boxes = LocateAnythingWorker.parse_boxes(answer, w, h)  # Pixel coordinates
points = LocateAnythingWorker.parse_points(answer, w, h)

Data Preparation

JSONL (ShareGPT Format)

{"conversations":[{"from":"human","value":"Detect all objects in <image-1>."},{"from":"gpt","value":"<ref>car</ref><box><100><200><400><500></box>"}],"image":"train/00001.jpg"}

Recipe JSON

{"my_data":{"annotation":["a.jsonl","b.jsonl"],"root":"/data/images/","repeat_time":1.0,"data_augment":true}}

repeat_time: ≥1 oversample, <1 downsample. Coordinates normalized to [0,1000].

8 Task Prompts

TaskMethodPrompt
----------------------
Detectiondetect(cats)Locate all the instances that matches: cat1cat2.
Single instanceground_single(p)Locate a single instance that matches: phrase.
Multi instanceground_multi(p)Locate all instances that match: phrase.
OCRdetect_text()Detect all the text in box format.
Text groundingground_text(p)Please locate the text referred as phrase.
GUI boxground_gui(p)Locate the region that matches: element.
GUI pointground_gui(p,pt)Point to: element.
Point groundingpoint(p)Point to: target.

Plain text: omit image field. Multi-image: image_list + .

Training

torchrun --nproc_per_node=8 eaglevl/train/locany_finetune_magi_stream.py \
  --model_name_or_path nvidia/LocateAnything-3B \
  --meta_path "./recipe.json" --output_dir work_dirs/sft \
  --max_steps 25000 --lr 2e-5 --bf16 True --block_size 6 \
  --attn_implementation magi --max_seq_length 16384 --max_num_tokens 25600 \
  --deepspeed deepspeed_configs/zero_stage2_config.json

Key Parameters

ParameterDescription
------------------------
--block_sizeMTP chunk size (default 4), use --causal_attn False during training
--attn_implementationmagi (Hopper/Blackwell 32K+) or sdpa (any GPU ~4K)
--freeze_llm/backbone/mlpFreeze corresponding modules
--max_num_tokensToken budget per batch (recommend 2-3× max_num_tokens_per_sample)
--packing_buffer_sizeOnline packing buffer (default 32, 64-128 for higher efficiency)

Non-Hopper GPU: --attn_implementation sdpa --max_seq_length 4096. OOM: --grad_checkpoint True + reduce --max_num_tokens.

Streaming Packing: Best-Fit + Big-Rocks-First algorithm, checkpoint resume bit-identical. DeepSpeed recommended zero_stage2.

Evaluation

# COCO / LVIS
bash evaluation/scripts/eval_coco.sh --model_path ... --test_jsonl ... --coco_json ... --output_dir ...
bash evaluation/scripts/eval_lvis.sh --model_path ... --test_jsonl ... --lvis_json ... --output_dir ...

# General grounding (Dense200/DocLayNet/HumanRef/RefCOCOg/VisDrone etc.)
bash evaluation/scripts/eval_grounding.sh --dataset Dense200 --eval_type box_eval ...

# Point evaluation / ScreenSpot-Pro
bash evaluation/scripts/eval_grounding.sh --dataset COCO --eval_type point_eval ...
bash evaluation/scripts/eval_sspro.sh --model_path ... --test_jsonl ... --output_dir ...

Requires Rex-Omni fastevaluate + data Mountchicken/Rex-Omni-EvalData likaixin/ScreenSpot-Pro.

Key Results

BenchmarkScoreComparison
-----------:-----:-----------
LVIS F1@Mean50.7+3.8 vs Rex-Omni
COCO F1@Mean54.7+1.8 vs Rex-Omni
M6Doc F1@Mean70.1+14.5 vs Rex-Omni
ScreenSpot-Pro Avg60.3SOTA
RefCOCOg val F1@Mean76.7SOTA
Pointing (7 benchmarks)Best on all
PBD dense scenes2-6× faster vs NTP

Model Info

  • Name: nvidia/LocateAnything-3B | LLM: Qwen2.5-3B | Vision: MoonViT-SO-400M | Length: 25K
  • HF: https://huggingface.co/nvidia/LocateAnything-3B
  • Demo: https://huggingface.co/spaces/nvidia/LocateAnything
  • Paper: arXiv:2605.27365

License

Code Apache 2.0 | Model NVIDIA License (non-commercial research)

References

  • GitHub: https://github.com/NVlabs/Eagle
  • Project page: https://nvlabs.github.io/Eagle/

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-06-03 13:43 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

dev-programming

Mcporter

steipete
使用 mcporter CLI 直接列出、配置、认证及调用 MCP 服务器/工具(支持 HTTP 或 stdio),涵盖临时服务器、配置编辑及 CLI/类型生成功能。
★ 196 📥 67,950
content-creation

Toutiao Graphic Publisher

openlark
通过浏览器自动化在头条发布图文内容,支持智能排版、自动生成热门标签等功能。
★ 2 📥 1,011
dev-programming

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 681 📥 329,366