A proof of concept question-answering system for different types of text data.
Go to file
Martin Popovski 2051715086
Add info for non Nvidia users
2024-02-15 02:16:57 +00:00
.devcontainer Add new features and improvements 2024-02-15 02:10:58 +00:00
.vscode Setup project 2024-02-12 21:21:04 +00:00
llm-qa Move README content 2024-02-15 02:12:44 +00:00
.gitignore First end-to-end working version 2024-02-13 23:47:10 +00:00
.pre-commit-config.yaml Setup project 2024-02-12 21:21:04 +00:00
LICENSE Add LICENSE 2024-02-10 19:48:36 +01:00
README.md Add info for non Nvidia users 2024-02-15 02:16:57 +00:00

README.md

LLM QA

A proof of concept question-answering system for different types of text data.

Currently implemented:

  • Plain text
  • Markdown

Key Features

Dockerized development environment

  • Easy, quick and reproducible setup

Automatic pull and serve of declared models

  • Ollama models are automatically pulled and served by the FastAPI server

Detailed logging

  • Key potential bottlenecks are timed and logged

Upsert

2024-02-15 01:10:54,341 - llm_qa.services.upsert - INFO - Split `MARKDOWN` type text into 8 document chunks in 0.01 seconds
2024-02-15 01:10:54,759 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference/embed "HTTP/1.1 200 OK"
2024-02-15 01:11:03,121 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference/embed "HTTP/1.1 200 OK"
2024-02-15 01:11:03,140 - llm_qa.services.upsert - INFO - Upserted 8 document chunks to Qdrant collection `showcase` in 8.80 seconds
2024-02-15 01:11:03,142 - uvicorn.access - INFO - 127.0.0.1:55868 - "POST /api/v1/upsert-text HTTP/1.1" 200 OK

Chat

2024-02-15 01:02:03,408 - llm_qa.dependencies - INFO - Ollama auto-pull enabled, checking if model is available
2024-02-15 01:02:03,441 - httpx - INFO - HTTP Request: POST http://ollama:11434/api/show "HTTP/1.1 200 OK"
2024-02-15 01:02:03,441 - llm_qa.dependencies - INFO - Ollama model `openchat:7b-v3.5-0106-q4_K_M` already exists
2024-02-15 01:02:03,645 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference/embed "HTTP/1.1 200 OK"
2024-02-15 01:02:03,653 - llm_qa.chains.time_logger - INFO - Chain `VectorStoreRetriever` finished in 0.08 seconds
2024-02-15 01:02:23,192 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference-rerank/rerank "HTTP/1.1 200 OK"
2024-02-15 01:02:23,194 - llm_qa.chains.time_logger - INFO - Chain `RerankAndTake` finished in 19.54 seconds
2024-02-15 01:02:29,817 - llm_qa.chains.time_logger - INFO - Chain `ChatOllama` finished in 6.62 seconds
2024-02-15 01:02:29,817 - llm_qa.services.chat - INFO - Chat chain finished in 26.27 seconds
2024-02-15 01:02:29,823 - uvicorn.access - INFO - 127.0.0.1:50100 - "POST /api/v1/chat HTTP/1.1" 200 OK

Hierarchical document chunking

  • Hierarchical text, such as markdown, is split into document chunks by headers
  • All previous parent headers are also included in the chunk, separated by ...
  • This enriches the context of the chunk and solves the problem of global context being lost when splitting the text

Example:

# AWS::SageMaker::ModelQualityJobDefinition MonitoringGroundTruthS3Input<a name="aws-properties-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input"></a>
...
## Syntax<a name="aws-properties-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input-syntax"></a>
...
### YAML<a name="aws-properties-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input-syntax.yaml"></a>
``` [S3Uri](#cfn-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input-s3uri): String ```

Retrieval query rewriting

  • After the first message, subsequent messages are rewritten to include previous messages context
  • This allows for a more natural conversation flow and retrieval of more relevant chunks

Example:

### User: What are all AWS regions where SageMaker is available?
### AI:  SageMaker is available in most AWS regions, except for the following: Asia Pacific (Jakarta), Africa (Cape Town), Middle East (UAE), Asia Pacific (Hyderabad), Asia Pacific (Osaka), Asia Pacific (Melbourne), Europe (Milan), AWS GovCloud (US-East), Europe (Spain), and Europe (Zurich) Region.

### User: What about the Bedrock service?
### Retrieval Query:  What is the availability of AWS SageMaker in relation to the Bedrock service?

Reranking

  • Retrieval of a larger number of document chunks is first performed using a vector store
  • Then, the chunks are reranked using a reranker model
  • This process more precisely selects the most relevant chunks for the user query

Development

Non Nvidia

If you don't have an Nvidia GPU, then remove the nvidia resource from the ollama service in the compose.yaml file.

Setup

First copy the .devcontainer/.env.example file to .devcontainer/.env and adjust the settings and models to your needs.

Then simply open the project devcontainer in a compatible IDE. This will setup all required tools and project dependencies for Python development. It will also run Docker containers for all required services.

Configuration

Create a llm-qa/.env file to override selective default environment variables located in llm-qa/.env.default.

Running

To run the FastAPI server, run the llm_qa.web submodule:

poetry run python -m llm_qa.web

To run the minimal CLI client, run the llm_qa.client submodule:

poetry run python -m llm_qa.client

Deployment

Not yet implemented.