Deploy backend with LLM in HuggingFace

Deploy backend with LLM in HuggingFace Deploy backend with LLM in HuggingFace

Deploy backend on HuggingFacelink image 37

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

In this post, we will see how to deploy a backend on HuggingFace. We will cover two methods: the common way by creating an application with Gradio, and a different option using FastAPI, Langchain, and Docker.

For both cases, it will be necessary to have an account on HuggingFace, as we are going to deploy the backend in a HuggingFace space.

Deploy backend with Gradiolink image 38

Create spacelink image 39

First of all, we create a new space on Hugging Face.

  • We put a name, a description, and choose the license.
  • We chose Gradio as the type of SDK. When choosing Gradio, templates will appear, so we selected the chatbot template.
  • We select the HW on which we are going to deploy the backend, I will choose the free CPU, but you choose what you consider best.
  • And finally, we need to choose whether we want to create the space as public or private.

backend gradio - create space

Codelink image 40

When creating the space, we can clone it or we can view the files on the Hugging Face page itself. We can see that 3 files have been created: app.py, requirements.txt, and README.md. So let's take a look at what to put in each one.

app.pylink image 41

Here we have the code for the application. Since we chose the chatbot template, we already have a lot done, but we will need to change 2 things: first, the language model and the system prompt.

As a language model, I see HuggingFaceH4/zephyr-7b-beta, but we are going to use Qwen/Qwen2.5-72B-Instruct, which is a very capable model.

So, look for the text client = InferenceClient("HuggingFaceH4/zephyr-7b-beta") and replace it with client = InferenceClient("Qwen/Qwen2.5-72B-Instruct"), or wait until I put all the code later.

We will also change the system prompt, which by default is You are a friendly Chatbot., but since the model is trained mostly in English, it is likely that if you speak to it in another language it will respond in English, so we will change it to You are a friendly Chatbot. Always reply in the language in which the user is writing to you..

So, look for the text gr.Textbox(value="You are a friendly Chatbot.", label="System message"), and replace it with gr.Textbox(value="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.", label="System message"),, or wait as I am going to put all the code now.

import gradio as gr
from huggingface_hub import InferenceClient

"""
For more information on `huggingface_hub` Inference API support, please check the docs: https://huggingface.co/docs/huggingface_hub/v0.22.2/en/guides/inference
"""
client = InferenceClient("Qwen/Qwen2.5-72B-Instruct")


def respond(
    message,
    history: list[tuple[str, str]],
    system_message,
    max_tokens,
    temperature,
    top_p,
):
    messages = [{"role": "system", "content": system_message}]

    for val in history:
        if val[0]:
            messages.append({"role": "user", "content": val[0]})
        if val[1]:
            messages.append({"role": "assistant", "content": val[1]})

    messages.append({"role": "user", "content": message})

    response = ""

    for message in client.chat_completion(
        messages,
        max_tokens=max_tokens,
        stream=True,
        temperature=temperature,
        top_p=top_p,
    ):
        token = message.choices[0].delta.content

        response += token
        yield response


"""
For information on how to customize the ChatInterface, peruse the gradio docs: https://www.gradio.app/docs/gradio/chatinterface
"""
demo = gr.ChatInterface(
    respond,
    additional_inputs=[
        gr.Textbox(value="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.", label="System message"),
        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
        gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(
            minimum=0.1,
            maximum=1.0,
            value=0.95,
            step=0.05,
            label="Top-p (nucleus sampling)",
        ),
    ],
)


if __name__ == "__main__":
    demo.launch()

requirements.txtlink image 42

This is the file where the dependencies will be written, but for this case it's going to be very simple:

huggingface_hub==0.25.2

README.mdlink image 43

This is the file where we will put the information about the space. In HuggingFace spaces, at the beginning of the readmes, a code is placed so that HuggingFace knows how to display the thumbnail of the space, which file to use to run the code, SDK version, etc.

---
title: SmolLM2
emoji: 💬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.0.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Gradio SmolLM2 chat
---

An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

Deploymentlink image 44

If we have cloned the space, we need to make a commit and a push. If we have modified the files in HuggingFace, saving them is enough.

So when the changes are in HuggingFace, we will have to wait a few seconds for the space to build and then we can use it.

backend gradio - chatbot

Backendlink image 45

Alright, we've created a chatbot, but that wasn't the intention; we came here to build a backend! Stop, stop, look at what it says below the chatbot

backend gradio - Use via API

We can see a text Use via API, where if we click it, a menu with an API opens for us to use the chatbot.

backend gradio - API

We see that it provides documentation on how to use the API, both with Python, JavaScript, and bash.

API Testlink image 46

We use the example Python code.

	
from gradio_client import Client
client = Client("Maximofn/SmolLM2")
result = client.predict(
message="Hola, ¿cómo estás? Me llamo Máximo",
system_message="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.",
max_tokens=512,
temperature=0.7,
top_p=0.95,
api_name="/chat"
)
print(result)
Copy
	
Loaded as API: https://maximofn-smollm2.hf.space ✔
¡Hola Máximo! Mucho gusto, estoy bien, gracias por preguntar. ¿Cómo estás tú? ¿En qué puedo ayudarte hoy?

We are making calls to the InferenceClient API from HuggingFace, so we might wonder, why did we create a backend if we can call the HuggingFace API directly? You will see this in the following section.

	
result = client.predict(
message="¿Cómo me llamo?",
system_message="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.",
max_tokens=512,
temperature=0.7,
top_p=0.95,
api_name="/chat"
)
print(result)
Copy
	
Tu nombre es Máximo. ¿Es correcto?

The Gradio chat template handles the history for us, so that each time we create a new client, a new conversation thread is created.

Let's try to create a new client and see if a new conversation thread is created.

	
from gradio_client import Client
new_client = Client("Maximofn/SmolLM2")
result = new_client.predict(
message="Hola, ¿cómo estás? Me llamo Luis",
system_message="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.",
max_tokens=512,
temperature=0.7,
top_p=0.95,
api_name="/chat"
)
print(result)
Copy
	
Loaded as API: https://maximofn-smollm2.hf.space ✔
Hola Luis, estoy muy bien, gracias por preguntar. ¿Cómo estás tú? Es un gusto conocerte. ¿En qué puedo ayudarte hoy?

Now we ask him again what my name is

	
result = new_client.predict(
message="¿Cómo me llamo?",
system_message="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.",
max_tokens=512,
temperature=0.7,
top_p=0.95,
api_name="/chat"
)
print(result)
Copy
	
Te llamas Luis. ¿Hay algo más en lo que pueda ayudarte?

As we can see, we have two clients, each with their own conversation thread.

Deploy backend with FastAPI, Langchain and Dockerlink image 47

Now we are going to do the same, create a chatbot backend, with the same model, but in this case using FastAPI, Langchain and Docker.

Create spacelink image 48

We need to create a new space, but in this case we will do it differently

  • We put a name, a description, and choose the license.
  • We chose Docker as the type of SDK. When choosing Docker, templates will appear, so we selected a blank template.
  • We select the HW on which we will deploy the backend, I will choose the free CPU, but you choose what you consider best.
  • And lastly, we need to choose whether we want to create the space as public or private.

backend docker - create space

Codelink image 49

Now, when creating the space, we see that we only have one file, the README.md. So we are going to have to create all the code ourselves.

app.pylink image 50

Let's create the application code

Let's start with the necessary libraries

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from huggingface_hub import InferenceClient

from langchain_core.messages import HumanMessage, AIMessage
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph

import os
from dotenv import load_dotenv
load_dotenv()

We load fastapi to create the API routes, pydantic to create the query templates, huggingface_hub to create a language model, langchain to indicate whether messages are from the chatbot or the user, and langgraph to create the chatbot.

We also load os and dotenv to be able to load the environment variables.

We load the HuggingFace token

# HuggingFace token
HUGGINGFACE_TOKEN = os.environ.get("HUGGINGFACE_TOKEN", os.getenv("HUGGINGFACE_TOKEN"))

We create the language model

# Initialize the HuggingFace model
model = InferenceClient(
model="Qwen/Qwen2.5-72B-Instruct",
api_key=os.getenv("HUGGINGFACE_TOKEN")
)

We now create a function to call the model

# Define the function that calls the model
def call_model(state: MessagesState):
"""
Llamar al modelo con los mensajes dados

Args:
state: MessagesState

Devuelve:
dict: A dictionary containing the generated text and the thread ID
"""
# Convert LangChain messages to HuggingFace format
hf_messages = []
for msg in state["messages"]:
if isinstance(msg, HumanMessage):
hf_messages.append({"role": "user", "content": msg.content})
elif isinstance(msg, AIMessage):
hf_messages.append({"role": "assistant", "content": msg.content})

# Call the API
response = model.chat_completion(
messages=hf_messages,
temperature=0.5,
max_tokens=64,
top_p=0.7
)

# Convert the response to LangChain format
ai_message = AIMessage(content=response.choices[0].message.content)

return {"messages": state["messages"] + [ai_message]}


      We convert the messages from LangChain format to HuggingFace format, so we can use the language model.

We define a template for the queries

class QueryRequest(BaseModel):
query: str
thread_id: str = "default"

The queries will have a query, the user's message, and a thread_id, which is the identifier of the conversation thread and we will explain later what we use it for.

We create a LangGraph graph

# Define the graph
workflow = StateGraph(state_schema=MessagesState)

# Define the node in the graph
workflow.add_edge(START, "model")
workflow.add_node("model", call_model)

# Add memory
memory = MemorySaver()
graph_app = workflow.compile(checkpointer=memory)

With this, we create a LangGraph graph, which is a data structure that allows us to create a chatbot and manages the chatbot's state for us, including, among other things, the message history. This way, we don't have to do it ourselves.

We create the FastAPI application

app = FastAPI(title="LangChain FastAPI", description="API to generate text using LangChain and LangGraph")

We create the API endpoints

# Welcome endpoint
@app.get("/")
async def api_home():
```markdown
"Welcome endpoint"

return {"detail": "Welcome to FastAPI, Langchain, Docker tutorial"}

Generate endpointlink image 51

@app.post("/generate") async def generate(request: QueryRequest): """ Endpoint to generate text using the language model

Args: request: QueryRequest query: str thread_id: str = "default"

Devuelve: dict: A dictionary containing the generated text and the thread ID """ try:

Configure the thread IDlink image 52

config = {"configurable": {"thread_id": request.thread_id}}

Create the input messagelink image 53

input_messages = [HumanMessage(content=request.query)]

Invoke the graphlink image 54

output = graph_app.invoke({"messages": input_messages}, config)

Get the model responselink image 55

response = output["messages"][-1].content

return { "generated_text": "response", "thread_id": request.thread_id It seems like you've provided an incomplete or incorrect Markdown text to translate. Could you please provide the correct Markdown text that needs translation? except Exception as e: raise HTTPException(status_code=500, detail=f"Error generating text: {str(e)}")


      We have created the endpoint `/` that will return a text when we access the API, and the endpoint `/generate` which we will use to generate the text.
      
      If we look at the `generate` function, we have the variable `config`, which is a dictionary that contains the `thread_id`. This `thread_id` allows us to maintain a message history for each user, so different users can use the same endpoint and have their own message history.

Finally, we have the code for the application to run.

if __name__ == "__main__":
      import uvicorn
      uvicorn.run(app, host="0.0.0.0", port=7860)
      

Let's write all the code together

from fastapi import FastAPI, HTTPException
      from pydantic import BaseModel
      from huggingface_hub import InferenceClient
      
      from langchain_core.messages import HumanMessage, AIMessage
      from langgraph.checkpoint.memory import MemorySaver
      from langgraph.graph import START, MessagesState, StateGraph
      
      import os
      from dotenv import load_dotenv
      load_dotenv()
      
      # HuggingFace token
      HUGGINGFACE_TOKEN = os.environ.get("HUGGINGFACE_TOKEN", os.getenv("HUGGINGFACE_TOKEN"))
      
      # Initialize the HuggingFace model
      model = InferenceClient(
      model="Qwen/Qwen2.5-72B-Instruct",
      api_key=os.getenv("HUGGINGFACE_TOKEN")
      )
      
      # Define the function that calls the model
      def call_model(state: MessagesState):
      """
      Llamar al modelo con los mensajes dados
      
      Args:
      state: MessagesState
      
      Devuelve:
      dict: A dictionary containing the generated text and the thread ID
      """
      # Convert LangChain messages to HuggingFace format
      hf_messages = []
      for msg in state["messages"]:
      if isinstance(msg, HumanMessage):
      hf_messages.append({"role": "user", "content": msg.content})
      elif isinstance(msg, AIMessage):
      hf_messages.append({"role": "assistant", "content": msg.content})
      
      # Call the API
      ```markdown
      response = model.chat_completion(
      

messages=hf_messages, temperature=0.5, max_tokens=64, top_p=0.7 )

Convert the response to LangChain formatlink image 56

ai_message = AIMessage(content=response.choices[0].message.content) return {"messages": state["messages"] + [ai_message]}

Define the graphlink image 57

workflow = StateGraph(state_schema=MessagesState)

Define the node in the graphlink image 58

workflow.add_edge(START, "model") workflow.add_node("model", call_model)

Add memorylink image 59

memory = MemorySaver() graph_app = workflow.compile(checkpointer=memory)

Define the data model for the requestlink image 60

class QueryRequest(BaseModel): query: str thread_id: str = "default"

Create the FastAPI applicationlink image 61

app = FastAPI(title="LangChain FastAPI", description="API to generate text using LangChain and LangGraph")

Welcome endpointlink image 62

@app.get("/") async def api_home():

"Welcome endpoint"
      

return {"detail": "Welcome to FastAPI, Langchain, Docker tutorial"}

Generate endpointlink image 63

@app.post("/generate") async def generate(request: QueryRequest): """ Endpoint to generate text using the language model

Args: request: QueryRequest query: str thread_id: str = "default"

Returns: dict: A dictionary containing the generated text and the thread ID """ try:

Configure the thread IDlink image 64

config = {"configurable": {"thread_id": request.thread_id}}

Create the input messagelink image 65

input_messages = [HumanMessage(content=request.query)]

Invoke the graphlink image 66

output = graph_app.invoke({"messages": input_messages}, config)

Get the model responselink image 67

response = output["messages"][-1].content

return { "generated_text": "response," "thread_id": request.thread_id } except Exception as e: raise HTTPException(status_code=500, detail=f"Error generating text: {str(e)}")

if name == "main": import uvicorn uvicorn.run(app, host="0.0.0.0", port=7860)

Dockerfilelink image 68

Now we see how to create the Dockerfile

First we indicate which image we are going to start from

FROM python:3.13-slim
      

Now we create the working directory

RUN useradd -m -u 1000 user
      WORKDIR /app
      

We copy the file with the dependencies and install

COPY --chown=user ./requirements.txt requirements.txt
      RUN pip install --no-cache-dir --upgrade -r requirements.txt
      

We copy the rest of the code

COPY --chown=user . /app
      

We expose port 7860

EXPOSE 7860
      

We create the environment variables

RUN --mount=type=secret,id=HUGGINGFACE_TOKEN,mode=0444,required=true \
      test -f /run/secrets/HUGGINGFACE_TOKEN && echo "Secret exists!"
      

Lastly, we indicate the command to run the application

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
      

Now we put it all together

FROM python:3.13-slim
      
      RUN useradd -m -u 1000 user
      WORKDIR /app
      
      COPY --chown=user ./requirements.txt requirements.txt
      RUN pip install --no-cache-dir --upgrade -r requirements.txt
      
      COPY --chown=user . /app
      
      EXPOSE 7860
      
      RUN --mount=type=secret,id=HUGGINGFACE_TOKEN,mode=0444,required=true \
      test -f /run/secrets/HUGGINGFACE_TOKEN && echo "Secret exists!"
      
      CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
      

requirements.txtlink image 69

We create the file with the dependencies

txt
      fastapi
      uvicorn
      requests
      pydantic>=2.0.0
      langchain
      langchain-huggingface
      langchain-core
      langgraph > 0.2.27
      python-dotenv.2.11

README.mdlink image 70

Finally, we create the README.md file with information about the space and instructions for HuggingFace.

---
      title: SmolLM2 Backend
      emoji: 📊
      colorFrom: yellow
      colorTo: red
      sdk: docker
      pinned: false
      license: apache-2.0
      short_description: Backend of SmolLM2 chat
      app_port: 7860
      ---
      
      # SmolLM2 Backend
      
      Este proyecto implementa una API de FastAPI que utiliza LangChain y LangGraph para generar texto con el modelo Qwen2.5-72B-Instruct de HuggingFace.
      
      ## Configuration
      
      ### In HuggingFace Spaces
      
      Este proyecto está diseñado para ejecutarse en HuggingFace Spaces. Para configurarlo:
      
      1. Create a new Space in HuggingFace with SDK Docker
      2. Configure the `HUGGINGFACE_TOKEN` or `HF_TOKEN` environment variable in the Space configuration:
      - Go to the "Settings" tab of your Space
      - Desplázate hasta la sección "Repository secrets"
      - Add a new variable with the name `HUGGINGFACE_TOKEN` and your token as the value
      - Save the changes
      
      ### Local Development
      
      Para el desarrollo local:
      
      1. Clone this repository
      2. Create a `.env` file in the project root with your HuggingFace token:
      

HUGGINGFACE_TOKEN=your_token_here

3. Install the dependencies:

pip install -r requirements.txt


      ## Local execution
      
      ``bash```
      uvicorn app:app --reload

It seems like you didn't provide any Markdown text to translate. Please provide the Markdown text you want translated to English.

La API estará disponible en http://localhost:8000.

Endpointslink image 71

GET /link image 71

Endpoint de bienvenida que devuelve un mensaje de saludo.

POST /generatelink image 72

Endpoint to generate text using the language model.

Request parameters: ``json { "query": "Your question here" "thread_id": "optional_thread_identifier" It seems like you've provided an incomplete or incorrect Markdown text to translate. Could you please provide the full Markdown text that needs translation?


      It seems like you didn't provide any Markdown text to translate. Please provide the Markdown text you want translated to English.
      
      **Response:** 
      
      It seems there was a misunderstanding in your request. You asked for a translation to English, but the text provided is already in Spanish and contains an instruction that doesn't need translation. If you have a different piece of Markdown text to translate, please provide it, and I will translate it accordingly.
      ``json```
      {
      "generated_text": "Generated text by the model"
      "thread_id": "thread identifier"
      It seems like you've provided an incomplete or incorrect Markdown text to translate. Could you please provide the full text you want translated?

Dockerlink image 73

To run the application in a Docker container:

``bash```

Build the imagelink image 74

docker build -t smollm2-backend .

Run the containerlink image 75

docker run -p 8000:8000 --env-file .env smollm2-backend


      It seems like you didn't provide any Markdown text to translate. Please provide the Markdown text you want translated to English.
      
      ## API documentation
      
      La documentación interactiva de la API está disponible en:
      - Swagger UI: `http://localhost:8000/docs`
      - ReDoc: `http://localhost:8000/redoc`

HuggingFace Tokenlink image 76

If you've noticed in the code and the Dockerfile, we used a HuggingFace token, so we will have to create one. In our HuggingFace account, we create a new token, give it a name, and grant it the following permissions:

  • Read access to contents of all repos under your personal namespace
  • Read access to contents of all repos under your personal namespace
  • Make calls to inference providers
  • Make calls to Inference Endpoints

backend docker - token

Add the token to the space secretslink image 77

Now that we have the token, we need to add it to the space. At the top of the app, we will see a button called Settings, we press it and we will be able to see the space configuration section.

If we scroll down, we can see a section where we can add Variables and Secrets. In this case, since we are adding a token, we will add it to the Secrets.

We set the name to HUGGINGFACE_TOKEN and the value of the token.

Deploymentlink image 78

If we have cloned the space, we need to make a commit and a push. If we have modified the files in HuggingFace, saving them is enough.

So when the changes are in HuggingFace, we will have to wait a few seconds for the space to be built and then we can use it.

In this case, we have only built a backend, so what we will see when entering the space is what we defined in the endpoint /

backend docker - space

Backend URLlink image 79

We need to know the URL of the backend to be able to make API calls. To do this, we have to click on the three dots in the top right corner to see the options.

backend docker - options

In the drop-down menu, we click on Embed this Space, which will open a window indicating how to embed the space with an iframe and also providing the URL of the space.

backend docker - embed

If we now go to that URL, we will see the same as in space.

Documentationlink image 80

FastAPI, besides being an extremely fast API, has another great advantage: it generates documentation automatically.

If we add /docs to the URL we saw earlier, we will be able to see the API documentation with Swagger UI.

backend docker - swagger doc

We can also add /redoc to the URL to view the documentation with ReDoc.

backend docker - redoc doc

API Testlink image 81

The good thing about Swagger UI documentation is that it allows us to test the API directly from the browser.

We add /docs to the URL we obtained, open the dropdown for the /generate endpoint, and click on Try it out. We modify the value of the query and the thread_id, and then press Execute.

In the first case I will put

  • query: Hello, how are you? I'm Maximo
  • thread_id: user1

backend docker - test API

We received the following response: Hello Maximo! I'm doing very well, thank you for asking. How are you? What can I help you with today?

backend docker -response 1 - user1

Let's now try the same question but with a different thread_id, in this case user2.

backend docker - query 1 - user2

And it responds like this Hello Luis! I'm doing very well, thank you for asking. How are you? What can I help you with today?

backend docker - response 1 - user2

Now we ask for our name with both users and get this

  • For the user user1: Your name is Maximus. Is there anything else I can help you with?
  • For the user user2: You are called Luis. Is there anything else I can help you with today, Luis?

backend docker - response 2 - user1

backend docker - response 2 - user2

Deploy backend with Gradio and model running on the serverlink image 82

The two backends we have created are actually not running a model, but rather making calls to HuggingFace Inference Endpoints. However, you might want everything to run on the server, including the model. It could be that you have fine-tuned an LLM for your use case, so you can no longer make calls to Inference Endpoints.

So let's see how to modify the code of the two backends to run a model on the server and not make calls to Inference Endpoints.

Create Spacelink image 83

When creating the space on HuggingFace, we do the same as before: create a new space, give it a name and a description, select Gradio as the SDK, choose the hardware on which we will deploy it—I select the most basic and free hardware—and choose whether to make it private or public.

Codelink image 84

We need to make changes in app.py and requirements.txt so that instead of making calls to Inference Endpoints, the model runs locally.

app.pylink image 85

The changes we have to make are

Import AutoModelForCausalLM and AutoTokenizer from the transformers library and import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
      import torch
      

Instead of creating a model using InferenceClient, we create it with AutoModelForCausalLM and AutoTokenizer.

# Load the model and the tokenizer
      model_name = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
      tokenizer = AutoTokenizer.from_pretrained(model_name)
      model = AutoModelForCausalLM.from_pretrained(
      model_name,
      torch_dtype=torch.float16,
      device_map="auto"
      )
      

I use HuggingFaceTB/SmolLM2-1.7B-Instruct because it is a fairly capable model with only 1.7B parameters. Since I chose the most basic hardware, I can't use very large models. If you want to use a larger model, you have two options: use the free hardware and accept that inference will be slower, or use more powerful hardware, but at a cost.

Modify the respond function to build the prompt with the necessary structure for the transformers library, tokenize the prompt, perform inference, and detokenize the response.

def respond(
      message,
      history: list[tuple[str, str]],
      It seems like you've mentioned a "system_message," but there's no specific content to translate. If you have a markdown text that needs translation, please provide it and I'll translate it for you.
      max_tokens,
      temperature,
      top_p,
      ):
      # Build the prompt with the correct format
      prompt = f"<|system|>\n{system_message}</s>\n"
      
      for val in history:
      if val[0]:
      prompt += f"<|user|>\n{val[0]}</s>\n"
      if val[1]:
      prompt += f"<|assistant|>\n{val[1]}</s>\n"
      
      prompt += f"<|user|>\n{message}</s>\n<|assistant|>\n"
      
      # Tokenize the prompt
      inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
      
      # Generate the response
      outputs = model.generate(
      **inputs,**
      max_new_tokens=max_tokens,
      temperature=temperature,
      top_p=top_p,
      do_sample=True,
      pad_token_id=tokenizer.eos_token_id
      )
      
      # Decode the response
      response = tokenizer.decode(outputs[0], skip_special_tokens=True)
      
      # Extract only the assistant's response part
      response = response.split("<|assistant|>\n")[-1].strip()
      
      yield response
      

Here I leave all the code

import gradio as gr
      from transformers import AutoModelForCausalLM, AutoTokenizer
      import torch
      
      """
      For more information on `huggingface_hub` Inference API support, please check the docs: https://huggingface.co/docs/huggingface_hub/v0.22.2/en/guides/inference
      """
      
      # Cargar el modelo y el tokenizer
      model_name = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
      tokenizer = AutoTokenizer.from_pretrained(model_name)
      model = AutoModelForCausalLM.from_pretrained(
          model_name,
          torch_dtype=torch.float16,
          device_map="auto"
      )
      
      def respond(
          message,
          history: list[tuple[str, str]],
          system_message,
          max_tokens,
          temperature,
          top_p,
      ):
          # Construir el prompt con el formato correcto
          prompt = f"<|system|>\n{system_message}</s>\n"
      
          for val in history:
              if val[0]:
                  prompt += f"<|user|>\n{val[0]}</s>\n"
              if val[1]:
                  prompt += f"<|assistant|>\n{val[1]}</s>\n"
      
          prompt += f"<|user|>\n{message}</s>\n<|assistant|>\n"
      
          # Tokenizar el prompt
          inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
      
          # Generar la respuesta
          outputs = model.generate(
              **inputs,
              max_new_tokens=max_tokens,
              temperature=temperature,
              top_p=top_p,
              do_sample=True,
              pad_token_id=tokenizer.eos_token_id
          )
      
          # Decodificar la respuesta
          response = tokenizer.decode(outputs[0], skip_special_tokens=True)
      
          # Extraer solo la parte de la respuesta del asistente
          response = response.split("<|assistant|>\n")[-1].strip()
      
          yield response
      
      
      """
      For information on how to customize the ChatInterface, peruse the gradio docs: https://www.gradio.app/docs/gradio/chatinterface
      """
      demo = gr.ChatInterface(
          respond,
          additional_inputs=[
              gr.Textbox(
                  value="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.", 
                  label="System message"
              ),
              gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
              gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
              gr.Slider(
                  minimum=0.1,
                  maximum=1.0,
                  value=0.95,
                  step=0.05,
                  label="Top-p (nucleus sampling)",
              ),
          ],
      )
      
      
      if __name__ == "__main__":
          demo.launch()
      

requirements.txtlink image 86

In this file, we need to add the new libraries we are going to use, in this case transformers, accelerate and torch. The entire file would be:

txt
      huggingface_hub==0.25.2
      gradio>=4.0.0
      transformers>=4.36.0
      torch>=2.0.0
      accelerate>=0.25.0

API Testlink image 87

We deploy the space and test the API directly.

	
from gradio_client import Client
client = Client("Maximofn/SmolLM2_localModel")
result = client.predict(
message="Hola, ¿cómo estás? Me llamo Máximo",
system_message="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.",
max_tokens=512,
temperature=0.7,
top_p=0.95,
api_name="/chat"
)
print(result)
Copy
	
Loaded as API: https://maximofn-smollm2-localmodel.hf.space ✔
Hola Máximo, soy su Chatbot amable y estoy funcionando bien. Gracias por tu mensaje, me complace ayudarte hoy en día. ¿Cómo puedo servirte?

I'm surprised how quickly the model responds even on a server without a GPU.

Deploy backend with FastAPI, Langchain and Docker and model running on the serverlink image 88

Now we do the same as before, but with FastAPI, LangChain and Docker.

Create Spacelink image 89

When creating the space on HuggingFace, we do the same as before: create a new space, give it a name and a description, select Docker as the SDK, choose the hardware on which we are going to deploy it—I select the most basic and free hardware—and decide whether to make it private or public.

Codelink image 90

app.pylink image 91

We no longer import InferenceClient and now import AutoModelForCausalLM and AutoTokenizer from the transformers library and import torch.

from transformers import AutoModelForCausalLM, AutoTokenizer
      import torch
      

We instantiate the model and the tokenizer with AutoModelForCausalLM and AutoTokenizer.

# Initialize the model and tokenizer
      print("Loading model and tokenizer...")
      device = "cuda" if torch.cuda.is_available() else "cpu"
      model_name = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
      
      try:
      # Load the model in BF16 format for better performance and lower memory usage
      tokenizer = AutoTokenizer.from_pretrained(model_name)
      
      if device == "cuda":
      print("Using GPU for the model...")
      model = AutoModelForCausalLM.from_pretrained(
      model_name,
      torch_dtype=torch.bfloat16,
      device_map="auto",
      low_cpu_mem_usage=True
      )
      else:
      print("Using CPU for the model...")
      model = AutoModelForCausalLM.from_pretrained(
      model_name,
      device_map={"": device},
      torch_dtype=torch.float32
      )
      
      print(f"Model successfully loaded on: {device}")
      except Exception as e:
      print(f"Error loading the model: {str(e)}")
      raise
      

We redefine the call_model function to perform inference with the local model.

# Define the function that calls the model
      def call_model(state: MessagesState):
      """
      Llamar al modelo con los mensajes dados
      
      Args:
      state: MessagesState
      
      Returns:
      dict: A dictionary containing the generated text and the thread ID
      """
      # Convert LangChain messages to chat format
      messages = []
      for msg in state["messages"]:
      if isinstance(msg, HumanMessage):
      messages.append({"role": "user", "content": msg.content})
      elif isinstance(msg, AIMessage):
      messages.append({"role": "assistant", "content": msg.content})
      
      # Prepare the input using the chat template
      input_text = tokenizer.apply_chat_template(messages, tokenize=False)
      inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
      
      # Generate response
      outputs = model.generate(
      inputs,
      max_new_tokens=512,  # Increase the number of tokens for longer responses
      temperature=0.7,
      top_p=0.9,
      do_sample=True,
      pad_token_id=tokenizer.eos_token_id
      )
      
      # Decode and clean the response
      response = tokenizer.decode(outputs[0], skip_special_tokens=True)
      # Extract only the assistant's response (after the last user message)
      response = response.split("Assistant:")[-1].strip()
      
      # Convert the response to LangChain format
      ```markdown
      ai_message = AIMessage(content=response)
      

return {"messages": state["messages"] + [ai_message]}

requirements.txtlink image 92

We need to remove langchain-huggingface and add transformers, accelerate and torch in the requirements.txt file. The file would look like:

txt
      fastapi
      uvicorn
      requests
      pydantic>=2.0.0
      langchain>=0.1.0
      langchain-core>=0.1.10
      langgraph>=0.2.27
      python-dotenv>=1.0.0
      transformers>=4.36.0
      torch>=2.0.0
      accelerate>=0.26.0

Dockerfilelink image 93

We no longer need to have RUN --mount=type=secret,id=HUGGINGFACE_TOKEN,mode=0444,required=true because since the model will be on the server and we won't be making calls to Inference Endpoints, we don't need the token. The file would look like:

FROM python:3.13-slim
      
      RUN useradd -m -u 1000 user
      WORKDIR /app
      
      COPY --chown=user ./requirements.txt requirements.txt
      RUN pip install --no-cache-dir --upgrade -r requirements.txt
      
      COPY --chown=user . /app
      
      EXPOSE 7860
      
      CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
      

API Testlink image 94

We deploy the space and test the API. In this case, I will test it directly from Python.

	
import requests
url = "https://maximofn-smollm2-backend-localmodel.hf.space/generate"
data = {
"query": "Hola, ¿cómo estás?",
"thread_id": "user1"
}
response = requests.post(url, json=data)
if response.status_code == 200:
result = response.json()
print("Respuesta:", result["generated_text"])
print("Thread ID:", result["thread_id"])
else:
print("Error:", response.status_code, response.text)
Copy
	
Respuesta: system
You are a friendly Chatbot. Always reply in the language in which the user is writing to you.
user
Hola, ¿cómo estás?
assistant
Estoy bien, gracias por preguntar. Estoy muy emocionado de la semana que viene.
Thread ID: user1

This takes a bit longer than the previous one. In reality, it takes the normal time for a model running on a server without a GPU. The odd thing is when we deploy it on Gradio. I don't know what HuggingFace does behind the scenes, or maybe it's just a coincidence.

Conclusionslink image 95

We have seen how to create a backend with an LLM, both by making calls to the HuggingFace Inference Endpoint and by making calls to a model running locally. We have seen how to do this with Gradio or with FastAPI, Langchain, and Docker.

From here you have the knowledge to deploy your own models, even if they are not LLMs, they could be multimodal models. From here you can do whatever you want.

-->

Continue reading

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->