Glossary

AI glossary for content assistants

Plain-English definitions of 13,917 AI terms for branded assistant teams.

Plain EnglishRAGLLMs

Start for Free

Search glossary terms

13,917 glossary pages match your filters.

Glossary

13,917 terms. Open one for definitions and related concepts.

Canary Deployment

Canary deployment gradually routes a small percentage of traffic to a new model version, monitoring for issues before full rollout, reducing the risk of deploying degraded models.

Open page

TensorFlow Serving

TensorFlow Serving is a production serving system designed for deploying TensorFlow models at scale with features like hot-swappable model versions and batching.

Open page

TorchServe

TorchServe is PyTorch's official serving solution that packages and serves PyTorch models with features like multi-model serving, logging, and metrics.

Open page

vLLM

vLLM is a high-throughput inference engine for large language models that uses PagedAttention to efficiently manage GPU memory and maximize serving throughput.

Open page

TGI

TGI (Text Generation Inference) is Hugging Face's production-grade inference server for large language models, optimized for high throughput with features like continuous batching and quantization.

Open page

GPTQ

GPTQ is a post-training quantization method for large language models that compresses model weights to lower precision (typically 4-bit) while preserving quality through careful calibration.

Open page

AWQ

AWQ (Activation-aware Weight Quantization) is a quantization method for LLMs that preserves important weights based on activation patterns, achieving efficient 4-bit compression.

Open page

SGLang

SGLang is a structured generation language and runtime for LLMs that enables efficient execution of complex prompting patterns like branching, forking, and constrained decoding.

Open page

Model Degradation

Model degradation is the gradual decline in ML model performance over time due to changes in data patterns, user behavior, or the environment the model operates in.

Open page

Latency Monitoring

Latency monitoring tracks the time taken for ML model inference requests, measuring end-to-end response times to ensure the model serving meets performance requirements.

Open page

AWS Bedrock

AWS Bedrock is Amazon's managed service for accessing foundation models from multiple providers through a single API, including models from Anthropic, Meta, Mistral, and Amazon.

Open page

Hugging Face Inference API

Hugging Face Inference API provides hosted, production-ready API endpoints for running models from the Hugging Face Hub without managing your own infrastructure.

Open page

Data Pipeline

A data pipeline is an automated workflow that extracts data from sources, transforms it, and loads it into destinations for analytics, ML training, or serving.

Open page

Experiment Management

Experiment management is the systematic organization, tracking, and comparison of machine learning experiments across parameters, datasets, and results.

Open page

Model Training Pipeline

A model training pipeline is an automated, reproducible workflow that takes raw data through preprocessing, feature engineering, model training, and evaluation.

Open page

Model Evaluation Pipeline

A model evaluation pipeline is an automated workflow that systematically assesses a trained model against defined metrics, benchmarks, and quality gates before deployment.

Open page

Model Selection

Model selection is the process of choosing the best model architecture, algorithm, and hyperparameters for a given task based on evaluation results and constraints.

Open page

Model Maintenance

Model maintenance encompasses the ongoing activities required to keep a deployed ML model performing well, including monitoring, retraining, updating, and patching.

Open page

Model Retirement

Model retirement is the planned process of decommissioning an ML model from production, including traffic migration, resource cleanup, and documentation archival.

Open page

Model Governance

Model governance is the framework of policies, processes, and controls that ensure ML models are developed, deployed, and maintained responsibly and in compliance with regulations.

Open page

Model Lifecycle

The model lifecycle encompasses all stages an ML model goes through, from initial problem definition and data collection to training, deployment, monitoring, and retirement.

Open page

Model Catalog

A model catalog is a searchable inventory of all ML models in an organization, providing metadata, documentation, and status information for discovery and governance.

Open page

Model Lineage

Model lineage tracks the complete provenance of an ML model, including the data, code, parameters, and environment used to create it.

Open page

CI/CD for ML

CI/CD for ML extends traditional continuous integration and continuous delivery practices to machine learning, automating testing, training, evaluation, and deployment of models.

Open page

Continuous Evaluation

Continuous evaluation is the practice of automatically and regularly assessing a deployed model against fresh data and updated benchmarks to detect performance changes.

Open page

Continuous Monitoring

Continuous monitoring is the practice of constantly observing ML system health, model performance, data quality, and resource usage in production environments.

Open page

TPU v4

TPU v4 is the fourth generation of Google custom tensor processing units, offering significant performance improvements for large-scale ML training and inference.

Open page

AWS Trainium

AWS Trainium is a custom ML chip designed by Amazon for high-performance, cost-effective deep learning training in the cloud.

Open page

AWS Inferentia

AWS Inferentia is a custom ML chip designed by Amazon for high-performance, cost-effective inference workloads in the cloud.

Open page

Intel Gaudi

Intel Gaudi is an AI accelerator designed for deep learning training and inference, offering a competitive alternative to NVIDIA GPUs with strong price-performance.

Open page

Multi-Node Training

Multi-node training distributes ML model training across multiple servers, each containing one or more GPUs, to handle models and datasets too large for a single machine.

Open page

Horovod

Horovod is an open-source distributed deep learning training framework that makes it easy to scale training across multiple GPUs and machines using data parallelism.

Open page

NCCL

NCCL (NVIDIA Collective Communications Library) provides optimized GPU-to-GPU communication primitives for distributed deep learning, including all-reduce, broadcast, and gather operations.

Open page

Gradient Synchronization

Gradient synchronization is the process of aggregating gradients across multiple GPUs during distributed training to ensure all model replicas update consistently.

Open page

Model Serving Infrastructure

Model serving infrastructure is the complete stack of hardware, software, and networking required to host and serve ML model predictions to applications and users.

Open page

Model Endpoint

A model endpoint is a network-accessible URL or service that accepts input data and returns model predictions, serving as the interface between ML models and applications.

Open page

gRPC Endpoint

A gRPC endpoint serves ML model predictions using the gRPC protocol, offering lower latency and higher throughput than REST for inter-service communication.

Open page

Streaming Inference

Streaming inference delivers model predictions incrementally as they are generated, rather than waiting for the complete result before responding.

Open page

API Gateway for ML

An API gateway for ML routes prediction requests to model endpoints, handling authentication, rate limiting, traffic management, and observability for ML APIs.

Open page

Load Balancer for ML

A load balancer for ML distributes prediction requests across multiple model serving replicas, optimizing for GPU utilization, latency, and availability.

Open page

Auto-Scaling for ML

Auto-scaling for ML automatically adjusts the number of model serving replicas based on demand, GPU utilization, or queue depth to balance cost and performance.

Open page

Blue-Green Deployment

Blue-green deployment is a release strategy that runs two identical production environments, allowing instant switching between the current (blue) and new (green) version of an ML model.

Open page

Model Warm-Up

Model warm-up is the process of loading an ML model into memory and running initial inference requests to optimize performance before serving production traffic.

Open page

Cold Start in ML

Cold start in ML refers to the delay when a model serving instance starts up, including loading model weights, initializing frameworks, and performing warm-up before serving predictions.

Open page

Ray Serve

Ray Serve is a scalable model serving framework built on Ray that supports complex inference graphs, dynamic batching, and seamless scaling across CPUs and GPUs.

Open page

MLflow Serving

MLflow Serving deploys models logged in MLflow as REST API endpoints, supporting multiple ML frameworks and providing a standardized serving interface.

Open page

Model Monitoring Infrastructure

Model monitoring infrastructure is the technical stack of tools and systems that collect, process, and alert on ML model performance, data quality, and operational metrics.

Open page

Feature Drift

Feature drift is the change in the statistical distribution of individual input features over time, potentially degrading model performance when production data diverges from training data.

Open page

Page 75 of 290. Showing 48 of 13,917 matching glossary pages.

Turn owned content into answers

Use InsertChat to launch a branded assistant visitors can ask directly.

Start for Free

7-day free trial · No card required

Interactive FAQ

Try the FAQ like a visitor.

Open product, pricing, security, integration, and free-tool questions in the same chat your visitors use.

InsertChat

Interactive FAQ

Hey. Pick a question below and see how InsertChat turns FAQs into clear, source-backed answers.

Just now

0 of 21 questions explored Instant FAQ answers

Product FAQ

What is InsertChat?

InsertChat is a white-label AI assistant for your website. Train it, brand it, publish it, and learn from visitor questions.

How does InsertChat use my website content?

Connect approved pages, docs, videos, FAQs, policies, and other sources. InsertChat turns them into source-backed answers and next steps.

Can I control the assistant's tone and sources?

Yes. Choose its sources, tone, welcome message, and prompts so it stays on brand.

How does InsertChat stay accurate?

Answers use approved content and source links. Analytics show unclear or missing answers so you can improve coverage.

Can it collect leads or route support questions?

Yes. InsertChat can collect details, qualify intent, add context, and send chats to the right inbox, CRM, workflow, or person.

Can I control how the assistant behaves?

Yes. Control prompts, model choice, tool access, and the branded assistant experience so behavior stays consistent.

Which AI models can I use?

InsertChat supports multiple model providers. Choose each assistant's model for quality, speed, and cost, or use BYOK.

Can I pick different models for different workflows?

Yes. Use a faster model for common questions and a stronger model for complex reasoning. InsertChat supports that balance per conversation.

Where can I deploy an assistant?

Use a widget, embed, full-page assistant, custom domain, in-app embed, or API. Reuse one setup across surfaces.

Do I need coding skills?

No. Build and deploy AI assistants using our visual builder. The embed code is one line of JavaScript.

Can I customize the branding and UI?

Yes. Customize the assistant name, logo, colors, welcome message, suggested prompts, tone, domain, and white-label presentation.

Can I use my own domain?

Yes. Custom domains are supported, typically via enterprise options.

Does InsertChat support voice?

Yes. Voice dictation and text-to-speech let users speak instead of type.

Does InsertChat support vision?

Yes. Enable vision for assistants when images help clarify a request or context.

What tools and integrations are supported?

Zendesk, HubSpot, Shopify, WooCommerce, calendar booking, web search, Perplexity, and webhooks for your own systems.

Can I control which tools the assistant is allowed to use?

Yes. Tool access is controlled per assistant so you enable only what you need.

Can the agent hand off to a human?

Yes. Configure human handoff so the agent escalates when needed. Full conversation history is passed along.

Do you provide analytics?

Yes. Track chats, leads, feedback, top questions, unanswered questions, most-used sources, and content gaps.

Is it mobile friendly?

Yes. The widget and embeds work well on desktop and mobile with no separate experience needed.

What's the fastest path to a successful deployment?

Start with one assistant and a small set of high-value sources. Iterate using real questions from analytics.

What is the fastest way to get started?

Create an account. Connect one key source. Ask a test question, brand the assistant, then publish it on one page.