Step 03 - Streaming responses

LLM responses can be long. Imagine asking the model to generate a story. It could potentially produce hundreds of lines of text.

In the current application, the entire response is accumulated before being sent to the client. During that generation, the client is waiting for the response, and the server is waiting for the model to finish generating the response. Sure there is the “…” bubble indicating that something is happening, but it is not the best user experience.

Streaming allows us to send the response in chunks as it is generated by the model. The model sends the response in chunks (tokens) and the server sends these chunks to the client as they arrive.

The final code of this step is located in the step-03 directory. However, we recommend you to follow the instructions below to get there, and continue extending your current application.

Asking the LLM to return chunks

The first step is to ask the LLM to return the response in chunks. Initially, our AI service looked like this:

CustomerSupportAgent.java

package dev.langchain4j.quarkus.workshop;

import io.quarkiverse.langchain4j.RegisterAiService;
import jakarta.enterprise.context.SessionScoped;

@SessionScoped
@RegisterAiService
public interface CustomerSupportAgent {

    String chat(String userMessage);
}

Note that the return type of the chat method is String. We will change it to Multi<String> to indicate that the response will be streamed instead of returned synchronously.

CustomerSupportAgent.java

package dev.langchain4j.quarkus.workshop;

import io.quarkiverse.langchain4j.RegisterAiService;
import io.smallrye.mutiny.Multi;
import jakarta.enterprise.context.SessionScoped;

@SessionScoped
@RegisterAiService
public interface CustomerSupportAgent {

    Multi<String> chat(String userMessage);
}

A Multi<String> is a stream of strings. Multi is a type from the Mutiny library that represents a stream of items, possibly infinite. In this case, it will be a stream of strings representing the response from the LLM, and it will be finite (fortunately). A Multi has other characteristics, such as the ability to handle back pressure, which we will not cover in this workshop.

Serving streams from the websocket

Ok, now our AI Service returns a stream of strings. But, we need to modify our websocket endpoint to handle this stream and send it to the client.

Currently, our websocket endpoint looks like this:

CustomerSupportAgentWebSocket.java

package dev.langchain4j.quarkus.workshop;

import io.quarkus.websockets.next.OnOpen;
import io.quarkus.websockets.next.OnTextMessage;
import io.quarkus.websockets.next.WebSocket;

@WebSocket(path = "/customer-support-agent")
public class CustomerSupportAgentWebSocket {

    private final CustomerSupportAgent customerSupportAgent;

    public CustomerSupportAgentWebSocket(CustomerSupportAgent customerSupportAgent) {
        this.customerSupportAgent = customerSupportAgent;
    }

    @OnOpen
    public String onOpen() {
        return "Welcome to Miles of Smiles! How can I help you today?";
    }

    @OnTextMessage
    public String onTextMessage(String message) {
        return customerSupportAgent.chat(message);
    }
}

Let’s modify the onTextMessage method to send the response to the client as it arrives.

CustomerSupportAgentWebSocket.java

package dev.langchain4j.quarkus.workshop;

import io.quarkus.websockets.next.OnOpen;
import io.quarkus.websockets.next.OnTextMessage;
import io.quarkus.websockets.next.WebSocket;
import io.smallrye.mutiny.Multi;

@WebSocket(path = "/customer-support-agent")
public class CustomerSupportAgentWebSocket {

    private final CustomerSupportAgent customerSupportAgent;

    public CustomerSupportAgentWebSocket(CustomerSupportAgent customerSupportAgent) {
        this.customerSupportAgent = customerSupportAgent;
    }

    @OnOpen
    public String onOpen() {
        return "Welcome to Miles of Smiles! How can I help you today?";
    }

    @OnTextMessage
    public Multi<String> onTextMessage(String message) {
        return customerSupportAgent.chat(message);
    }
}

That’s it! Now the response will be streamed to the client as it arrives. This is because Quarkus understands that the return type is a Multi natively, and it knows how to handle it.

Testing the streaming

To test the streaming, you can use the same chat interface as before. The application should still be running. Go back to the browser, refresh the page, and start chatting. If you ask simple questions, you may not notice the difference.

Ask something like

Tell me a story containing 500 words

and you will see the response being displayed as it arrives.

Let’s now switch to the next step!