|
|
||
|---|---|---|
| .devcontainer | ||
| .vscode | ||
| llm-qa | ||
| .gitignore | ||
| .pre-commit-config.yaml | ||
| LICENSE | ||
| README.md | ||
README.md
LLM QA
A proof of concept question-answering system for different types of text data.
Currently implemented:
- Plain text
- Markdown
Key Features
Dockerized development environment
- Easy, quick and reproducible setup
Automatic pull and serve of declared models
- Ollama models are automatically pulled and served by the FastAPI server
Detailed logging
- Key potential bottlenecks are timed and logged
Upsert
2024-02-15 01:10:54,341 - llm_qa.services.upsert - INFO - Split `MARKDOWN` type text into 8 document chunks in 0.01 seconds
2024-02-15 01:10:54,759 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference/embed "HTTP/1.1 200 OK"
2024-02-15 01:11:03,121 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference/embed "HTTP/1.1 200 OK"
2024-02-15 01:11:03,140 - llm_qa.services.upsert - INFO - Upserted 8 document chunks to Qdrant collection `showcase` in 8.80 seconds
2024-02-15 01:11:03,142 - uvicorn.access - INFO - 127.0.0.1:55868 - "POST /api/v1/upsert-text HTTP/1.1" 200 OK
Chat
2024-02-15 01:02:03,408 - llm_qa.dependencies - INFO - Ollama auto-pull enabled, checking if model is available
2024-02-15 01:02:03,441 - httpx - INFO - HTTP Request: POST http://ollama:11434/api/show "HTTP/1.1 200 OK"
2024-02-15 01:02:03,441 - llm_qa.dependencies - INFO - Ollama model `openchat:7b-v3.5-0106-q4_K_M` already exists
2024-02-15 01:02:03,645 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference/embed "HTTP/1.1 200 OK"
2024-02-15 01:02:03,653 - llm_qa.chains.time_logger - INFO - Chain `VectorStoreRetriever` finished in 0.08 seconds
2024-02-15 01:02:23,192 - httpx - INFO - HTTP Request: POST http://text-embeddings-inference-rerank/rerank "HTTP/1.1 200 OK"
2024-02-15 01:02:23,194 - llm_qa.chains.time_logger - INFO - Chain `RerankAndTake` finished in 19.54 seconds
2024-02-15 01:02:29,817 - llm_qa.chains.time_logger - INFO - Chain `ChatOllama` finished in 6.62 seconds
2024-02-15 01:02:29,817 - llm_qa.services.chat - INFO - Chat chain finished in 26.27 seconds
2024-02-15 01:02:29,823 - uvicorn.access - INFO - 127.0.0.1:50100 - "POST /api/v1/chat HTTP/1.1" 200 OK
Hierarchical document chunking
- Hierarchical text, such as markdown, is split into document chunks by headers
- All previous parent headers are also included in the chunk, separated by
... - This enriches the context of the chunk and solves the problem of global context being lost when splitting the text
Example:
# AWS::SageMaker::ModelQualityJobDefinition MonitoringGroundTruthS3Input<a name="aws-properties-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input"></a>
...
## Syntax<a name="aws-properties-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input-syntax"></a>
...
### YAML<a name="aws-properties-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input-syntax.yaml"></a>
``` [S3Uri](#cfn-sagemaker-modelqualityjobdefinition-monitoringgroundtruths3input-s3uri): String ```
Retrieval query rewriting
- After the first message, subsequent messages are rewritten to include previous messages context
- This allows for a more natural conversation flow and retrieval of more relevant chunks
Example:
### User: What are all AWS regions where SageMaker is available?
### AI: SageMaker is available in most AWS regions, except for the following: Asia Pacific (Jakarta), Africa (Cape Town), Middle East (UAE), Asia Pacific (Hyderabad), Asia Pacific (Osaka), Asia Pacific (Melbourne), Europe (Milan), AWS GovCloud (US-East), Europe (Spain), and Europe (Zurich) Region.
### User: What about the Bedrock service?
### Retrieval Query: What is the availability of AWS SageMaker in relation to the Bedrock service?
Reranking
- Retrieval of a larger number of document chunks is first performed using a vector store
- Then, the chunks are reranked using a reranker model
- This process more precisely selects the most relevant chunks for the user query
Development
Non Nvidia
If you don't have an Nvidia GPU, then remove the nvidia resource from the ollama service in the compose.yaml file.
Setup
First copy the .devcontainer/.env.example file to .devcontainer/.env and adjust the settings and models to your needs.
Then simply open the project devcontainer in a compatible IDE. This will setup all required tools and project dependencies for Python development. It will also run Docker containers for all required services.
Configuration
Create a llm-qa/.env file to override selective default environment variables located in llm-qa/.env.default.
Running
To run the FastAPI server, run the llm_qa.web submodule:
poetry run python -m llm_qa.web
To run the minimal CLI client, run the llm_qa.client submodule:
poetry run python -m llm_qa.client
Deployment
Not yet implemented.