Kapampangan2English
The Overview
This project addresses the limited linguistic support for Kapampangan in mainstream Large Language Models (LLMs). By combining a customized Small Language Model (SLM) with a Retrieval-Augmented Generation (RAG) pipeline, the system provides precise translations grounded in verified dictionary data. The application is served via a FastAPI backend and consumed through an interactive Streamlit web interface.
Phase 1: Fine-Tuning the Pre-Project
Before building the RAG application, the core translation intelligence had to be developed. This involved fine-tuning a base model on specific Kapampangan-English datasets.
- Model & DatasetThe foundation is
Qwen3-1.7B, a highly efficient Small Language Model. It was trained using the Coco-18 Kapampangan-English dataset. - Unsloth & QuantizationTo make fine-tuning accessible and efficient, the model was quantized to 4-bit precision using Unsloth and trained using QLoRa. This drastically reduced VRAM requirements while preserving translation quality.
- BenchmarkingThe performance was rigorously tested by comparing the BLEU and chrF scores of the raw model against the fine-tuned version to ensure quantitative improvements.
Phase 2: RAG Pipeline Integration
The fine-tuned model acts as the primary generation engine. However, to expand its vocabulary beyond the training data and prevent hallucinations, a local Vector Database was introduced.
- Data ProcessingDictionary data was scraped using Selenium & BeautifulSoup, then cleaned and normalized with LLM assistance (Claude) before being embedded using
all-MiniLM-L6-v2. - Vector SearchThe embeddings are stored in ChromaDB. When a user queries a word, the system retrieves relevant definitions to augment the prompt for the fine-tuned SLM.
- Orchestration & ServingLangChain orchestrates the retrieval and generation phases. Inference is handled efficiently by Ollama. The entire backend is exposed via FastAPI, with a clean UI built in Streamlit.
Tech Stack
- Python
- FastAPI & Uvicorn
- Streamlit
- LangChain & Ollama
- ChromaDB
- Unsloth & QLoRa
- Qwen3-1.7B
- Selenium & BeautifulSoup
Pipeline Highlights
- 4-bit GGUF Quantization
- Custom Fine-tuned Translator
- RAG augmented definition retrieval
- Microservice Architecture