Vector Databases with LangChain

This is a serial post, If you do not know about LangChain, I recommend you read this one before.

Retrieving huge amounts of data effectively is crucial in today’s world. Traditional databases and retrieval methods often fall short when dealing with unstructured or semi-structured data, such as text, images, and embeddings. This is where vector databases and LangChain retrievers come into play, offering powerful solutions to modern data retrieval challenges.

Understanding Vector Databases

A vector database is designed to store and query high-dimensional vectors, which represent data points in a mathematical space. Unlike traditional databases that work well with structured data (like SQL databases), vector databases excel in handling unstructured data by leveraging vector embeddings.

What are vector embeddings?

Vector embeddings are a way to convert words sentences and other data into numbers that capture their meaning and relationships. These embeddings capture the semantic meaning of the data, allowing for more nuanced and accurate retrievals. I liked the examples given in this link about creating vector embeddings. https://www.pinecone.io/learn/vector-embeddings/

Key Features of Vector Databases:

  1. High-Dimensional Storage: Capable of storing vectors with hundreds or thousands of dimensions.
  2. Similarity Search: Efficiently find vectors similar to a given query vector using metrics like Euclidean distance, cosine similarity, etc.
  3. Scalability: Handle large volumes of data and perform real-time searches.
  4. Versatility: Useful in various applications, including natural language processing (NLP), image recognition, recommendation systems, and more.

One of the critical components of LangChain is its retriever module, which leverages vector databases to enhance information retrieval.

Key Features of LangChain Retrievers:

  1. Integration with Vector Databases: Seamlessly connect with various vector databases to store and retrieve embeddings.
  2. Customizable Pipelines: Build retrieval pipelines tailored to specific use cases and requirements.
  3. Efficiency: Optimize retrieval processes for performance and accuracy.

I have created a sample program that shows Loading, Transforming, and embedding

  1. how the data is ingested in the form of PDF, by using LangChain’s third-party PDF- loader
  2. The further process is done using LangChain’s text splitter (Recursive text splitter) – Transform
  3. Using the OllamaEmbeddings, I demonstrate how I used the transformed text to store into a vector DB- Embed
  4. After successfully storing into the DB, I query using LangChain’s similarity_search
LangChain VectorStore flow

Code

I am using the PDF in this link: https://main.icmr.nic.in/sites/default/files/upload_documents/ICMR_Guidelines_for_Management_of_Type_1_Diabetes.pdf

from langchain_community.vectorstores import FAISS
from  langchain_community.document_loaders import TextLoader
import faiss
from langchain_community.embeddings import OllamaEmbeddings
from langchain.docstore.document import Document
from  langchain_community.document_loaders import TextLoader
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
LANGCHAIN_API_KEY="YOUR_KEY"
LANGCHAIN_PROJECT="LANGCHAIN_PORTFOLIO"

## Pdf reader
from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('type1Diabetes.pdf')
docs=loader.load()
#print(docs)

#split text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
documents=text_splitter.split_documents(docs)
#print(documents[:5])

#store in DB and retrieve
db = FAISS.from_documents(documents, OllamaEmbeddings(model="llama3"))
query = "What are the Aims of nutritional management?"
retireved_results=db1.similarity_search(query)
print(retireved_results[0].page_content)
Output
Lifestyle-Diet and ExerciseChapter-2
Dr. Vijayasarathi HA, Dr. C.S. Yajnik
Introduction
Lifestyle management (LSM) plays an essential role in managing type 1 diabetes mellitus
(T1DM). Understanding the effect of diet and physical activity on glycemia is essential for optimal management of T1DM.
Aims of Nutritional Management
1) Maintain glycemia in the normal to the near-normal range with minimal/no hypoglycemia.
2)    Maintain optimal blood pressure, weight, and lipid levels.
3) Ensure adequate nutrition to facilitate healthy growth and development in children and
adolescents.
4) T o prevent the development or progression of diabetes-related microvascular and macrovascular complications.
5) Address individual nutrition needs, incorporating personal, social, and cultural
preferences.
6)    Improve overall health through appropriate food choices.
Concepts of energy and proximate principles of diet 

Conclusion

We can build highly efficient and accurate retrieval systems by combining the strengths of vector databases and LangChain retrievers. Whether you’re working with text, images, or other unstructured data types, this powerful combination offers a scalable and versatile solution for modern data retrieval challenges.

As AI and machine learning evolve, tools like vector databases and frameworks like LangChain will become increasingly essential, enabling developers to unlock new possibilities and drive innovation.