A proof of concept question-answering system for different types of text data.

Go to file

Martin Popovski 2051715086 Add info for non Nvidia users		2024-02-15 02:16:57 +00:00
.devcontainer	Add new features and improvements	2024-02-15 02:10:58 +00:00
.vscode	Setup project	2024-02-12 21:21:04 +00:00
llm-qa	Move README content	2024-02-15 02:12:44 +00:00
.gitignore	First end-to-end working version	2024-02-13 23:47:10 +00:00
.pre-commit-config.yaml	Setup project	2024-02-12 21:21:04 +00:00
LICENSE	Add LICENSE	2024-02-10 19:48:36 +01:00
README.md	Add info for non Nvidia users	2024-02-15 02:16:57 +00:00

README.md

LLM QA

A proof of concept question-answering system for different types of text data.

Currently implemented:

Plain text
Markdown

Key Features

Dockerized development environment

Easy, quick and reproducible setup

Automatic pull and serve of declared models

Ollama models are automatically pulled and served by the FastAPI server

Detailed logging

Key potential bottlenecks are timed and logged

Upsert

2024-02-15 01:10:54,341 - llm_qa.services.upsert - INFO - Split `MARKDOWN` type text into 8 document chunks in 0.01 seconds
2024-02-15 01:10:54,759 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference/embed "HTTP/1.1 200 OK"
2024-02-15 01:11:03,121 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference/embed "HTTP/1.1 200 OK"
2024-02-15 01:11:03,140 - llm_qa.services.upsert - INFO - Upserted 8 document chunks to Qdrant collection `showcase` in 8.80 seconds
2024-02-15 01:11:03,142 - uvicorn.access - INFO - 127.0.0.1:55868 - "POST /api/v1/upsert-text HTTP/1.1" 200 OK

Chat

2024-02-15 01:02:03,408 - llm_qa.dependencies - INFO - Ollama auto-pull enabled, checking if model is available
2024-02-15 01:02:03,441 - httpx - INFO - HTTP Request: POST http://ollama:11434/api/show "HTTP/1.1 200 OK"
2024-02-15 01:02:03,441 - llm_qa.dependencies - INFO - Ollama model `openchat:7b-v3.5-0106-q4_K_M` already exists
2024-02-15 01:02:03,645 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference/embed "HTTP/1.1 200 OK"
2024-02-15 01:02:03,653 - llm_qa.chains.time_logger - INFO - Chain `VectorStoreRetriever` finished in 0.08 seconds
2024-02-15 01:02:23,192 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference-rerank/rerank "HTTP/1.1 200 OK"
2024-02-15 01:02:23,194 - llm_qa.chains.time_logger - INFO - Chain `RerankAndTake` finished in 19.54 seconds
2024-02-15 01:02:29,817 - llm_qa.chains.time_logger - INFO - Chain `ChatOllama` finished in 6.62 seconds
2024-02-15 01:02:29,817 - llm_qa.services.chat - INFO - Chat chain finished in 26.27 seconds
2024-02-15 01:02:29,823 - uvicorn.access - INFO - 127.0.0.1:50100 - "POST /api/v1/chat HTTP/1.1" 200 OK

Hierarchical document chunking

Hierarchical text, such as markdown, is split into document chunks by headers
All previous parent headers are also included in the chunk, separated by ...
This enriches the context of the chunk and solves the problem of global context being lost when splitting the text

Example:

# AWS::SageMaker::ModelQualityJobDefinition MonitoringGroundTruthS3Input<a name="aws-properties-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input"></a>
...
## Syntax<a name="aws-properties-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input-syntax"></a>
...
### YAML<a name="aws-properties-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input-syntax.yaml"></a>
``` [S3Uri](#cfn-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input-s3uri): String ```

Retrieval query rewriting

After the first message, subsequent messages are rewritten to include previous messages context
This allows for a more natural conversation flow and retrieval of more relevant chunks

Example:

### User: What are all AWS regions where SageMaker is available?
### AI:  SageMaker is available in most AWS regions, except for the following: Asia Pacific (Jakarta), Africa (Cape Town), Middle East (UAE), Asia Pacific (Hyderabad), Asia Pacific (Osaka), Asia Pacific (Melbourne), Europe (Milan), AWS GovCloud (US-East), Europe (Spain), and Europe (Zurich) Region.

### User: What about the Bedrock service?
### Retrieval Query:  What is the availability of AWS SageMaker in relation to the Bedrock service?

Reranking

Retrieval of a larger number of document chunks is first performed using a vector store
Then, the chunks are reranked using a reranker model
This process more precisely selects the most relevant chunks for the user query

Development

Non Nvidia

If you don't have an Nvidia GPU, then remove the nvidia resource from the ollama service in the compose.yaml file.

Setup

First copy the .devcontainer/.env.example file to .devcontainer/.env and adjust the settings and models to your needs.

Then simply open the project devcontainer in a compatible IDE. This will setup all required tools and project dependencies for Python development. It will also run Docker containers for all required services.

Configuration

Create a llm-qa/.env file to override selective default environment variables located in llm-qa/.env.default.

Running

To run the FastAPI server, run the llm_qa.web submodule:

poetry run python -m llm_qa.web

To run the minimal CLI client, run the llm_qa.client submodule:

poetry run python -m llm_qa.client

Deployment

Not yet implemented.