Real-Time Open-Vocabulary Object Detection

1 Tencent AI Lab, 2 ARC Lab, Tencent PCG, 3 Huazhong University of Science and Technology
*Equal Contribution   📧 Corresponding Author   ⭐ Project Lead

CVPR 2024

πŸ”₯What's New
  • [2024-3-18] We are excited to announce that YOLO-World has been accepted by CVPR 2024, hope to see you in Seattle! Now, YOLO-World supports prompt tuning, image prompts, high-resolution images (1280x1280), and ONNX export.
  • [2024-2-18] We thank @SkalskiP for developing the wonderful segmentation demo via connecting YOLO-World and EfficientSAM. You can try it now at the πŸ€— HuggingFace Spaces.
  • [2024-2-17] We release the code & models for YOLO-World-Seg now! YOLO-World now supports open-vocabulary / zero-shot object segmentation!
  • [2024-2-10] We provide the fine-tuning and data details for fine-tuning YOLO-World on the COCO dataset or the custom datasets!
  • [2024-2-3] We support the Gradio demo now in the repo and you can build the YOLO-World demo on your own device!
  • [2024.2.1] We have released the code and models of YOLO-World.
  • [2024.1.31] The technical report of YOLO-World are available now!

πŸ€— Demo

Video Guide

Thank @SkalskiP for contributing the video guide about YOLO-World!

πŸ“– Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

YOLO-World

🌟 Highlights

  • YOLO-World is the next-generation of YOLO detectors, aiming for real-time open-vocabulary object detection.
  • YOLO-World is pre-trained on large-scale vision-language datasets, including Objects365, GQA, Flickr30K, and CC3M, which enpowers YOLO-World with strong zero-shot open-vocabulary capbility and grounding ability in images.
  • YOLO-World achieves fast inference speeds and we present re-parameterization techniques for faster inference and deployment given users' vocabularies.

βš™οΈ Framework

  • The YOLO-World builds the YOLO detector with the frozen CLIP-based text encoder for extracting text embeddings from the input texts, e.g., object categories or noun phrases.
  • The YOLO-World contains an Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) to facilitate the interaction between multi-scale image features and text embeddings. The RepVL-PAN can re-parameterize the user's offline vocabularies into the model parameters for fast inference and deployment.
  • The YOLO-World is pre-trained on large-scale region-text datasets with the region-text contrastive loss to learn the region-level alignment between vision and language. For normal image-text datasets, e.g., CC3M, we adopt an automatic labeling approach to generate pseudo region-text pairs.

Please check more details in our technical report.

πŸ“Š Performance

1. Zero-Shot Evaluation on LVIS

We compare the zero-shot performance on LVIS (minival) of recent open-vocabulary detectors:

Method Backbone Pre-trained Data FPS(V100) AP APr
GLIP-T Swin-T O365,GoldG 0.12 24.9 17.7
GLIP-T Swin-T O365,GoldG,Cap4M 0.12 26.0 20.8
GLIPv2-T Swin-T O365,GoldG 0.12 26.9 -
GLIPv2-T Swin-T O365,GoldG,Cap4M 0.12 29.0 -
GroundingDINO-T Swin-T O365,GoldG 1.5 25.6 14.4
GroundingDINO-T Swin-T O365,GoldG,Cap4M 1.5 27.4 18.1
DetCLIP-T Swin-T O365,GoldG 2.3 34.4 26.9
YOLO-World-S YOLOv8-S O365,GoldG 74.1 26.2 19.1
YOLO-World-M YOLOv8-M O365,GoldG 58.1 31.0 23.8
YOLO-World-L YOLOv8-L O365,GoldG 52.0 35.0 27.1
YOLO-World-L YOLOv8-L O365,GoldG,CC-250K 52.0 35.4 27.6
Zero-shot Evaluation on LVIS minival

2. Speed and Accuracy Curve

We compare the speed and accuracy curve of pre-trained YOLO-World vesus recent open-vocabulary detectors on zero-shot LVIS evaluation:

3. Visualizations

We provide some visualization results generated by the pre-trained YOLO-World(L):


(a) Visualization Results on Zero-shot Inference on LVIS

(b) Visualization Results on User’s Vocabulary

(c) Visualization Results on Referring Object Detection

BibTeX

If you find YOLO-World is useful in your research or applications, please consider giving us a citation.


        @article{cheng2024yolow,
          title={YOLO-World: Real-Time Open-Vocabulary Object Detection},
          author={Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying},
          journal={arXiv preprint arXiv:},
          year={2024}
        }
  

Acknowledgement

This website is adapted from Nerfies, LLaVA, and ShareGPT4V, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.