Inference Endpoints - Hugging Face

Machine Learning At Your Service

Easily deploy Transformers, Diffusers or any model on dedicated, fully managed infrastructure. Keep your costs low with our secure, compliant and flexible production solution.

Learn More

No Hugging Face account ? Sign up!

One-click inference deployment

Import your favorite model from the Hugging Face hub or browse our catalog of hand-picked, ready-to-deploy models !

Llama-3.1-70B-Instruct

A 70-billion parameter model from Meta, optimized for dialogue. Generates helpful, safe responses and outperforms other open-source chat LLMs.

GPU 4x Nvidia L40S

$ 8.3

/ hour

gemma-2-27b-it

An instruction model fine-tuned from Gemma 2, Google's open LLM. This version has 27 billion parameters.

GPU 4x Nvidia L4

$ 3.8

/ hour

Qwen2.5-Coder-7B-Instruct

Instruction-tuned 7B model for code generation, code reasoning and code fixing from Qwen, supports a context length of up to 128K tokens.

GPU 1x Nvidia L4

$ 0.8

/ hour

FLUX.1-schnell

A 12-billion parameter rectified flow transformer that delivers cutting-edge output quality and competitive prompt following.

GPU 1x Nvidia L40S

$ 1.8

/ hour

mxbai-embed-large-v1

This model produces 1024-dimensional embeddings that perform extremely well on a variety of tasks. See the repository for the prefix required for queries.

GPU 1x Nvidia L4

$ 0.8

/ hour

whisper-large-v3-turbo

Finetuned version of a pruned Whisper large-v3. 8x faster than the original, at the expense of a minor quality degradation.

GPU 1x Nvidia T4

$ 0.5

/ hour

Browse Catalog Hub Models

Customer Stories

Learn how leading AI teams use Inference Endpoints to deploy their models

Endpoints for Music

Musixmatch is the world’s leading music data company

Use Case

Custom text embeddings generation pipeline

Models Deployed

Distilbert-base-uncased-finetuned-sst-2-english
facebook/wav2vec2-base-960h
Custom model based on sentence transformers

The coolest thing was how easy it was to define a complete custom interface from the model to the inference process. It just took us a couple of hours to adapt our code, and have a functioning and totally custom endpoint.