Glossary

AI glossary for content assistants

Plain-English definitions of 13,917 AI terms for branded assistant teams.

Plain EnglishRAGLLMs

Start for Free

Search glossary terms

13,917 glossary pages match your filters.

Glossary

13,917 terms. Open one for definitions and related concepts.

Multimodal Learning

Multimodal learning is the field of training AI models to understand and relate information from multiple modalities like text, images, and audio simultaneously.

Open page

Multimodal Fusion

Multimodal fusion combines information from multiple modalities into a unified representation, enabling AI models to reason jointly about different types of data.

Open page

Cross-modal Learning

Cross-modal learning trains models to transfer knowledge between modalities, such as using text supervision to improve visual representations or generating one modality from another.

Open page

Visual-Language Model

A visual-language model (VLM) is an AI model that jointly understands images and text, enabling tasks like visual question answering, captioning, and image-guided conversation.

Open page

Multimodal Embedding

Multimodal embeddings map data from different modalities (text, images, audio) into a shared vector space where semantically similar items are close together regardless of their modality.

Open page

Multimodal Reasoning

Multimodal reasoning is the ability of AI models to draw conclusions and make inferences by combining information from multiple modalities like text, images, and data.

Open page

Panoptic Segmentation

Panoptic segmentation unifies semantic and instance segmentation, assigning every pixel in an image both a class label and an instance identity.

Open page

Human Pose Estimation

Human pose estimation detects and tracks body joint positions to reconstruct skeletal configurations of people in images or video.

Open page

Face Verification

Face verification determines whether two face images belong to the same person, performing a one-to-one identity comparison.

Open page

Facial Landmark Detection

Facial landmark detection locates specific points on a face such as eyes, nose, mouth corners, and jawline to map facial geometry.

Open page

Facial Expression Recognition

Facial expression recognition classifies the emotional state displayed on a face, detecting expressions like happiness, sadness, anger, and surprise.

Open page

Age Estimation

Age estimation uses computer vision to predict the apparent age of a person from their facial image.

Open page

Face Generation

Face generation uses generative AI models to synthesize realistic human face images that depict people who do not exist.

Open page

YOLOv5

YOLOv5 is a popular object detection model by Ultralytics that brought YOLO to the PyTorch ecosystem with excellent ease of use and performance.

Open page

SSD (Single Shot Detector)

SSD is a single-shot object detection architecture that predicts bounding boxes and class scores from multiple feature map scales in a single forward pass.

Open page

RetinaNet

RetinaNet is a one-stage object detector that introduced focal loss to address class imbalance between foreground objects and background in dense detection.

Open page

Grounding DINO

Grounding DINO is an open-set object detector that combines DINO detection with grounded pre-training, enabling detection of arbitrary objects described in text.

Open page

SAM 2

SAM 2 extends the Segment Anything Model to video, enabling real-time promptable segmentation and tracking of objects across video frames.

Open page

EfficientDet

EfficientDet is a family of scalable object detection models that use compound scaling and a bi-directional feature pyramid network for efficient multi-scale detection.

Open page

Anchor-Based Detection

Anchor-based detection uses predefined reference boxes (anchors) of various sizes and aspect ratios as starting points for predicting object locations.

Open page

Anchor-Free Detection

Anchor-free detection predicts object locations directly without predefined reference boxes, using approaches like center-point prediction or corner detection.

Open page

Gemini Vision

Gemini Vision refers to the visual understanding capabilities of Google Gemini models, enabling multimodal reasoning across text, images, video, and audio.

Open page

Claude Vision

Claude Vision refers to the visual understanding capabilities of Anthropic Claude models, enabling image analysis, document comprehension, and visual reasoning.

Open page

Visual Grounding

Visual grounding locates specific regions in an image that correspond to a natural language expression, connecting text descriptions to visual content.

Open page

Visual Reasoning

Visual reasoning is the ability of AI models to draw logical conclusions from visual information, going beyond perception to higher-order understanding.

Open page

Scene Text Recognition

Scene text recognition detects and reads text appearing naturally in images, such as signs, labels, license plates, and street names.

Open page

Chart Understanding

Chart understanding enables AI models to interpret data visualizations like bar charts, line graphs, pie charts, and scatter plots, extracting data and insights.

Open page

Image-to-Image

Image-to-image translation transforms an input image into a corresponding output image, applying changes like style, content, or domain transfer.

Open page

DALL-E

DALL-E is a series of text-to-image generation models by OpenAI that create images from natural language descriptions with high fidelity and creativity.

Open page

Midjourney Model

Midjourney is an AI image generation model known for its exceptional aesthetic quality, artistic style, and photorealistic rendering capabilities.

Open page

AI Image Editing

AI image editing uses machine learning to intelligently modify images, enabling tasks like object removal, background replacement, and text-guided editing.

Open page

Image Colorization

Image colorization uses AI to automatically add realistic color to grayscale or black-and-white photographs and videos.

Open page

Video Classification

Video classification assigns category labels to video clips, analyzing temporal and spatial patterns to understand the overall content or activity shown.

Open page

Video Object Tracking

Video object tracking follows specific objects across video frames, maintaining their identity even through occlusion, appearance changes, and camera motion.

Open page

Video Captioning

Video captioning generates natural language descriptions of video content, summarizing actions, events, and scenes depicted across temporal sequences.

Open page

Optical Flow

Optical flow estimates the pattern of apparent motion between consecutive video frames, representing the pixel-level displacement of objects and the camera.

Open page

Video Diffusion Model

Video diffusion models extend image diffusion architectures to generate or edit video content by modeling temporal coherence across frames.

Open page

Point Cloud

A point cloud is a set of 3D data points in space, typically generated by LiDAR sensors or depth cameras, representing the surface geometry of objects and environments.

Open page

LiDAR

LiDAR (Light Detection and Ranging) uses laser pulses to measure distances and create precise 3D maps of environments with centimeter-level accuracy.

Open page

SLAM

SLAM (Simultaneous Localization and Mapping) enables a device to build a map of an unknown environment while simultaneously tracking its own location within it.

Open page

Scene Understanding

Scene understanding is the comprehensive perception of a visual scene, including recognizing objects, their relationships, spatial layout, and contextual meaning.

Open page

Photogrammetry

Photogrammetry reconstructs 3D models and measurements from overlapping photographs, using multiple camera viewpoints to triangulate 3D geometry.

Open page

Cross-Modal Retrieval

Cross-modal retrieval searches for content in one modality using a query from a different modality, such as finding images using text descriptions.

Open page

Multimodal RAG

Multimodal RAG extends retrieval-augmented generation to handle multiple data types, retrieving and reasoning over text, images, tables, and charts together.

Open page

Multimodal Agent

A multimodal agent is an AI agent that can perceive and interact with its environment through multiple sensory modalities including vision, language, and action.

Open page

Image Augmentation

Image augmentation applies transformations to training images to artificially expand dataset size and diversity, improving model generalization and robustness.

Open page

Feature Extraction

Feature extraction converts raw image pixels into meaningful numerical representations that capture visual patterns, structures, and semantics for downstream tasks.

Open page

Transfer Learning for Vision

Transfer learning for vision applies knowledge from models pretrained on large image datasets to new visual tasks, enabling strong performance with limited task-specific data.

Open page

Page 99 of 290. Showing 48 of 13,917 matching glossary pages.

Turn owned content into answers

Use InsertChat to launch a branded assistant visitors can ask directly.

Start for Free

7-day free trial · No card required

Interactive FAQ

Try the FAQ like a visitor.

Open product, pricing, security, integration, and free-tool questions in the same chat your visitors use.

InsertChat

Interactive FAQ

Hey. Pick a question below and see how InsertChat turns FAQs into clear, source-backed answers.

Just now

0 of 21 questions explored Instant FAQ answers

Product FAQ

What is InsertChat?

InsertChat is a white-label AI assistant for your website. Train it, brand it, publish it, and learn from visitor questions.

How does InsertChat use my website content?

Connect approved pages, docs, videos, FAQs, policies, and other sources. InsertChat turns them into source-backed answers and next steps.

Can I control the assistant's tone and sources?

Yes. Choose its sources, tone, welcome message, and prompts so it stays on brand.

How does InsertChat stay accurate?

Answers use approved content and source links. Analytics show unclear or missing answers so you can improve coverage.

Can it collect leads or route support questions?

Yes. InsertChat can collect details, qualify intent, add context, and send chats to the right inbox, CRM, workflow, or person.

Can I control how the assistant behaves?

Yes. Control prompts, model choice, tool access, and the branded assistant experience so behavior stays consistent.

Which AI models can I use?

InsertChat supports multiple model providers. Choose each assistant's model for quality, speed, and cost, or use BYOK.

Can I pick different models for different workflows?

Yes. Use a faster model for common questions and a stronger model for complex reasoning. InsertChat supports that balance per conversation.

Where can I deploy an assistant?

Use a widget, embed, full-page assistant, custom domain, in-app embed, or API. Reuse one setup across surfaces.

Do I need coding skills?

No. Build and deploy AI assistants using our visual builder. The embed code is one line of JavaScript.

Can I customize the branding and UI?

Yes. Customize the assistant name, logo, colors, welcome message, suggested prompts, tone, domain, and white-label presentation.

Can I use my own domain?

Yes. Custom domains are supported, typically via enterprise options.

Does InsertChat support voice?

Yes. Voice dictation and text-to-speech let users speak instead of type.

Does InsertChat support vision?

Yes. Enable vision for assistants when images help clarify a request or context.

What tools and integrations are supported?

Zendesk, HubSpot, Shopify, WooCommerce, calendar booking, web search, Perplexity, and webhooks for your own systems.

Can I control which tools the assistant is allowed to use?

Yes. Tool access is controlled per assistant so you enable only what you need.

Can the agent hand off to a human?

Yes. Configure human handoff so the agent escalates when needed. Full conversation history is passed along.

Do you provide analytics?

Yes. Track chats, leads, feedback, top questions, unanswered questions, most-used sources, and content gaps.

Is it mobile friendly?

Yes. The widget and embeds work well on desktop and mobile with no separate experience needed.

What's the fastest path to a successful deployment?

Start with one assistant and a small set of high-value sources. Iterate using real questions from analytics.

What is the fastest way to get started?

Create an account. Connect one key source. Ask a test question, brand the assistant, then publish it on one page.