Install Llama Stack
This document describes how to install and deploy Llama Stack Server on Kubernetes using the Llama Stack Operator.
TOC
Upload OperatorInstall OperatorDeploy Llama Stack ServerTool calling with vLLM on KServeEnable PGVector Vector StoreHugging Face Access For Embedding ModelsUpload Operator
Download the Llama Stack Operator installation file (e.g., llama-stack-operator.alpha.ALL.xxxx.tgz).
Use the violet command to publish to the platform repository:
Install Operator
-
Go to the
Administratorview in the Alauda Container Platform. -
In the left navigation, select
Marketplace/Operator Hub. -
In the right panel, find
Alauda build of Llama Stackand clickInstall. -
Keep all parameters as default and complete the installation.
Deploy Llama Stack Server
After the operator is installed, deploy Llama Stack Server by creating a LlamaStackDistribution custom resource:
Note: Prepare the following in advance; otherwise the distribution may not become ready:
- Inference URL:
VLLM_URLmust point at a vLLM OpenAI-compatible HTTP base URL (for example an in-cluster vLLM or KServe InferenceService) that serves the target model.- Secret (optional):
VLLM_API_TOKENis only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it fromcontainerSpec.env(see the commented example in the manifest below).- Storage Class: Ensure the
defaultStorage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready.- PGVector (optional): To use
vector_storeswithprovider_id="pgvector", providePGVECTOR_*environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes thepgvectorextension.- Embedding model download: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If a mirror or proxy is needed, configure
HF_ENDPOINT. For fully offline environments, pre-download the model files into the server PVC before running the first vector-store request.
After deployment, the Llama Stack Server will be available within the cluster. The access URL is displayed in status.serviceURL, for example:
Tool calling with vLLM on KServe
The following applies to the vLLM predictor on KServe, not to the LlamaStackDistribution manifest. For agent flows that use tools (client-side tools or MCP), the vLLM process must expose tool-call support. Add predictor container args as required by upstream vLLM, for example:
Choose --tool-call-parser (and any related flags) according to the served model and the vLLM documentation for that model family.
Enable PGVector Vector Store
When ENABLE_PGVECTOR=true is set on the server, Llama Stack can create vector stores by using provider_id="pgvector" from the client API.
Recommended preparation:
- Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password.
- Expose the database connection to the
LlamaStackDistributionwithPGVECTOR_HOST,PGVECTOR_PORT,PGVECTOR_DB,PGVECTOR_USER, andPGVECTOR_PASSWORD. - Use the default embedding model provided by Llama Stack, and make sure its model files can be fetched on first use.
- If the cluster uses a Hugging Face mirror or proxy, set
HF_ENDPOINTaccordingly. - If the cluster is fully offline, pre-download the embedding model files into the server PVC and enable offline cache-related environment variables.
After the distribution is ready, you can validate the setup with the PGVector section in the Quickstart notebook.
Hugging Face Access For Embedding Models
Llama Stack uses a default embedding model for vector-store operations. On first use, the server downloads the model files from Hugging Face into its local cache.
Recommended cache path:
/home/lls/.lls/huggingface/hub
Common deployment modes:
-
Mirror or proxy access:
-
Fully offline access:
Pre-download the required model files into the PVC-backed cache directory
/home/lls/.lls/huggingface/hub, then set:
If the cache path is pre-populated correctly, the server can create PGVector-backed vector stores without downloading model artifacts at runtime.