OpenAI-Compatible API

In this guide, we'll focus on creating a text generation API using Google Gemini as an example, including an optional streaming feature, while Google Gemini already offers an OpenAI-Compatible API, we'll build our own version to demonstrate how to implement a text generation endpoint and integrate streaming capabilities. This guide will provide detailed, step-by-step instructions and code examples.

Please note that this example will not be perfectly optimized but will help you understand the process of setting up and deploying a text generation service.

Code Example

Importing Libraries for our Application

This code imports necessary libraries for building our app.

import os
import asyncio
import json
import time
import uuid
from typing import Optional, List
from pydantic import BaseModel
from fastapi import FastAPI, HTTPException
from starlette.responses import StreamingResponse
import google.generativeai as genai

from dotenv import load_dotenv
load_dotenv()

It includes tools for handling asynchronous operations, data validation, web API creation, streaming responses, and interacting with the AI itself. It also loads environment variables, likely for storing API keys (GEMINI_API_KEY).

Setting up Gemini: Configuration and Parameters

This code configures the Google Gemini AI with your API key and sets parameters like :

temperature (controls randomness)
top_p and top_k (influence word selection)
max_output_tokens (limits response length)
Desired output format (text/plain).

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

model_config = {
    "temperature": 0.1,
    "top_p": 0.95,
    "top_k": 64,
    "max_output_tokens": 2048,
    "response_mime_type": "text/plain",
}

These settings fine-tune how Gemini generates text, influencing its creativity and output length.

API Structure and Request Model

This code sets up the FastAPI web application and defines the structure for incoming requests .


app = FastAPI(title="Gemini openai-compatible api")


class Message(BaseModel):
    role: str
    content: str


class ChatCompletionRequest(BaseModel):
    max_tokens: Optional[int] = 1024
    temperature: Optional[float] = 0.3
    messages: List[Message]
    model: Optional[str] = "gemini-1.5-pro-exp-0801"
    stream: Optional[bool] = False

It creates a ChatCompletionRequest model that outlines the expected data:

max_tokens (maximum response length)
temperature (controls randomness)
model (specifies the Gemini model to use)
messages (the conversation history)
stream (whether to stream the response)

Streaming Response Generator

We defines a function called _async_resp_generator that handles the streaming of the model-generated text response, it takes the generated text and the original request as input.

async def _async_resp_generator(text_resp: str , request: ChatCompletionRequest):
    model = genai.GenerativeModel(
                    model_name=request.model,
                    generation_config={
                                **model_config,
                                "temperature": request.temperature,
                                "max_output_tokens": request.max_tokens,
                            })

    for chunk in model.generate_content(text_resp, stream=True):
        for i, token in enumerate(chunk.text.split(' ')):
            chunk = {
                "id": i,
                "object": "chat.completion.chunk",
                "created": time.time(),
                "model": request.model,
                "choices": [{"delta": {"content": f"{token} "}}],
            }
            yield f"data: {json.dumps(chunk)}\n\n"
            await asyncio.sleep(0.01)
    yield "data: [DONE]\n\n"

It then iterates through the text, breaking it into smaller chunks (words in this case), and yields each chunk as a JSON object formatted for streaming.

This allows the client to receive and display the text progressively, creating a more interactive and dynamic experience, it's also includes a small delay to simulate a more realistic streaming scenario.

Handling Chat Completion Requests

Here we defines the main endpoint for handling chat completion requests ,it first checks if any messages were provided in the request, if not it raises an error, then it extracts the last message from the conversation history and uses the specified Gemini model to generate a response.

@app.post("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    if not request.messages:
        raise HTTPException(status_code=400, detail="No messages provided.")

    last_message = request.messages[-1].content
    model = genai.GenerativeModel(
        model_name=request.model,
        generation_config={
            **model_config,
            "temperature": request.temperature,
            "max_output_tokens": request.max_tokens,
        }
    )
    resp_content = model.generate_content(last_message).text

    if request.stream:
        return StreamingResponse(
            _async_resp_generator(resp_content , request), media_type="application/x-ndjson"
        )

    return {
        "id": uuid.uuid4(),
        "object": "chat.completion",
        "created": time.time(),
        "model": request.model,
        "choices": [{"message": Message(role="assistant", content=resp_content)}],
    }

The key part here is the handling of the stream parameter ,if stream is set to True, it uses the _async_resp_generator function we discussed earlier to stream the response back to the client in chunks ,if stream is False it returns the entire response as a single JSON object ,this provides flexibility for different use cases, allowing for both immediate and streamed responses.

Running the Application

This code is what actually starts the FastAPI application, it uses the uvicorn library, a popular ASGI (Asynchronous Server Gateway Interface) server, to run the app.

if __name__ == "__main__":
    import uvicorn

    uvicorn.run(
        app,
        host="0.0.0.0",
        port=8000,
    )

The host="0.0.0.0" setting makes the application accessible from any network interface, and port=8000 specifies the port number to listen on , so when you run this code, your API will be accessible at http://0.0.0.0:8000, allowing you to send chat completion requests and receive responses from the Google Gemini AI.

This is the command that brings your entire API to life .

python app.py

Testing the API

Without Streaming :

This code demonstrates how to use the API you've built (without the streaming feature).

from openai import OpenAI

client = OpenAI(
    api_key="fake-api-key",
    base_url="http://localhost:8000"  
)

res = client.chat.completions.create(
    model="gemini-1.5-flash",
    messages=[{"role": "user", "content": "is 1 + 1 equal to 2?"}],
    stream=False,
)

print(res.choices[0].message.content)

It utilizes the openai library, which is commonly used for interacting with OpenAI's API, but in this case, it's configured to point to your local API running at http://localhost:8000.

We creates a chat completion request, asking any question and specifying the gemini-1.5-flash model. Importantly stream is set to False indicating that the entire response should be returned at once.

With Streaming:

And This code snippet demonstrates how to use the API with the streaming functionality enabled, like the previous example it uses the openai library and points to your local API, and this time stream is set to True.

This means that instead of receiving the entire response at once, the API will send back chunks of the response as they are generated. the code then iterates through these chunks, printing each one without a newline. effectively simulating a live stream of the AI's response. this is particularly useful for longer responses, as it provides immediate feedback to the user and creates a more engaging experience.

from openai import OpenAI
import sys

client = OpenAI(
    api_key="fake-api-key",
    base_url="http://localhost:8000"  
)

res = client.chat.completions.create(
    model="gemini-1.5-flash",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True,
)

for chunk in res:
    sys.stdout.write(chunk.choices[0].delta.content or "")
    sys.stdout.flush()

In this specific example we asks for a joke and the response will be streamed back piece by piece, building suspense until the punchline is finally revealed, this illustrates the power of streaming for creating more dynamic and interactive AI applications.

Thanks for following along, we've walked through the creation of a basic text generation API using Google Gemini. While this provides a solid foundation, remember that this implementation, particularly the Streaming Response Generator section, isn't fully optimized, there's always room for improvement in terms of efficiency and performance. However hopefully, this guide has given you a clearer understanding of how to build and interact with a text generation API, with and without streaming capabilities.

Happy prompting !