Florence-2
Unified vision foundation model by Microsoft for captioning, detection, and segmentation.
About
Florence-2 by Microsoft is a compact vision foundation model that handles many vision and vision-language tasks through one prompt-based sequence-to-sequence interface. With simple text prompts it performs captioning, object detection, grounding, OCR, and segmentation, and a Transformers implementation is published on Hugging Face. Despite its small size it covers a broad task range. Released under the MIT license.
Reviews (0)
Leave a Review
No reviews yet. Be the first to review!
Details
- Price
- Free
- Platform
- Local/Desktop
- Difficulty
- Intermediate (3/5)
- License
- MIT
- Minimum VRAM
- 6 GB
- Added
- Apr 3, 2026
Related Tools
Foundation model for monocular depth estimation by TikTok.
Improved vision-language model by Google using sigmoid loss for contrastive learning.
Open-vocabulary object detection model by Google using vision transformers.
OpenMMLab detection toolbox with 300+ pre-trained models and 80+ algorithms.
State-of-the-art real-time object detection supporting YOLOv5 through v11.
Additive angular margin loss for deep face recognition.