Use LangChain, GPT and Activeloop's Deep Lake to work with code base
In this tutorial, we are going to use Langchain + Activeloop's Deep Lake with GPT to analyze the code base of the LangChain itself.
Design
- Prepare data:
- Upload all python project files using the
langchain.document_loaders.TextLoader
. We will call these files the documents. - Split all documents to chunks using the
langchain.text_splitter.CharacterTextSplitter
. - Embed chunks and upload them into the DeepLake using
langchain.embeddings.openai.OpenAIEmbeddings
andlangchain.vectorstores.DeepLake
- Upload all python project files using the
- Question-Answering:
- Build a chain from
langchain.chat_models.ChatOpenAI
andlangchain.chains.ConversationalRetrievalChain
- Prepare questions.
- Get answers running the chain.
- Build a chain from
Implementation
Integration preparations
We need to set up keys for external services and install necessary python libraries.
#!python3 -m pip install --upgrade langchain deeplake openai
Set up OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate.
For full documentation of Deep Lake please follow https://docs.activeloop.ai/ and API reference https://docs.deeplake.ai/en/latest/
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass()
# Please manually enter OpenAI Key
Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the platform at app.activeloop.ai
activeloop_token = getpass("Activeloop Token:")
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
Prepare data
Load all repository files. Here we assume this notebook is downloaded as the part of the langchain fork and we work with the python files of the langchain
repo.
If you want to use files from different repo, change root_dir
to the root dir of your repo.
ls "../../../.."
from langchain.document_loaders import TextLoader
root_dir = "../../../.."
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for file in filenames:
if file.endswith(".py") and "/.venv/" not in dirpath:
try:
loader = TextLoader(os.path.join(dirpath, file), encoding="utf-8")
docs.extend(loader.load_and_split())
except Exception as e:
pass
print(f"{len(docs)}")
API Reference:
- TextLoader from
langchain.document_loaders
Then, chunk the files
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
print(f"{len(texts)}")
API Reference:
- CharacterTextSplitter from
langchain.text_splitter
Then embed chunks and upload them to the DeepLake.
This can take several minutes.
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
embeddings
API Reference:
- OpenAIEmbeddings from
langchain.embeddings.openai
from langchain.vectorstores import DeepLake
db = DeepLake.from_documents(
texts, embeddings, dataset_path=f"hub://{<org_id>}/langchain-code"
)
db
API Reference:
- DeepLake from
langchain.vectorstores
Optional
: You can also use Deep Lake's Managed Tensor Database as a hosting service and run queries there. In order to do so, it is necessary to specify the runtime parameter as {'tensor_db': True} during the creation of the vector store. This configuration enables the execution of queries on the Managed Tensor Database, rather than on the client side. It should be noted that this functionality is not applicable to datasets stored locally or in-memory. In the event that a vector store has already been created outside of the Managed Tensor Database, it is possible to transfer it to the Managed Tensor Database by following the prescribed steps.
# from langchain.vectorstores import DeepLake
# db = DeepLake.from_documents(
# texts, embeddings, dataset_path=f"hub://{<org_id>}/langchain-code", runtime={"tensor_db": True}
# )
# db
API Reference:
- DeepLake from
langchain.vectorstores
Question Answering
First load the dataset, construct the retriever, then construct the Conversational Chain
db = DeepLake(
dataset_path=f"hub://{<org_id>}/langchain-code",
read_only=True,
embedding_function=embeddings,
)
retriever = db.as_retriever()
retriever.search_kwargs["distance_metric"] = "cos"
retriever.search_kwargs["fetch_k"] = 20
retriever.search_kwargs["maximal_marginal_relevance"] = True
retriever.search_kwargs["k"] = 20
You can also specify user defined functions using Deep Lake filters
def filter(x):
# filter based on source code
if "something" in x["text"].data()["value"]:
return False
# filter based on path e.g. extension
metadata = x["metadata"].data()["value"]
return "only_this" in metadata["source"] or "also_that" in metadata["source"]
### turn on below for custom filtering
# retriever.search_kwargs['filter'] = filter
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
model = ChatOpenAI(model_name="gpt-3.5-turbo") # 'ada' 'gpt-3.5-turbo' 'gpt-4',
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)
API Reference:
- ChatOpenAI from
langchain.chat_models
- ConversationalRetrievalChain from
langchain.chains
questions = [
"What is the class hierarchy?",
# "What classes are derived from the Chain class?",
# "What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?",
# "What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
]
chat_history = []
for question in questions:
result = qa({"question": question, "chat_history": chat_history})
chat_history.append((question, result["answer"]))
print(f"-> **Question**: {question} \n")
print(f"**Answer**: {result['answer']} \n")
-> Question: What is the class hierarchy?
Answer: There are several class hierarchies in the provided code, so I'll list a few:
BaseModel
->ConstitutionalPrinciple
:ConstitutionalPrinciple
is a subclass ofBaseModel
.BasePromptTemplate
->StringPromptTemplate
,AIMessagePromptTemplate
,BaseChatPromptTemplate
,ChatMessagePromptTemplate
,ChatPromptTemplate
,HumanMessagePromptTemplate
,MessagesPlaceholder
,SystemMessagePromptTemplate
,FewShotPromptTemplate
,FewShotPromptWithTemplates
,Prompt
,PromptTemplate
: All of these classes are subclasses ofBasePromptTemplate
.APIChain
,Chain
,MapReduceDocumentsChain
,MapRerankDocumentsChain
,RefineDocumentsChain
,StuffDocumentsChain
,HypotheticalDocumentEmbedder
,LLMChain
,LLMBashChain
,LLMCheckerChain
,LLMMathChain
,LLMRequestsChain
,PALChain
,QAWithSourcesChain
,VectorDBQAWithSourcesChain
,VectorDBQA
,SQLDatabaseChain
: All of these classes are subclasses ofChain
.BaseLoader
:BaseLoader
is a subclass ofABC
.BaseTracer
->ChainRun
,LLMRun
,SharedTracer
,ToolRun
,Tracer
,TracerException
,TracerSession
: All of these classes are subclasses ofBaseTracer
.OpenAIEmbeddings
,HuggingFaceEmbeddings
,CohereEmbeddings
,JinaEmbeddings
,LlamaCppEmbeddings
,HuggingFaceHubEmbeddings
,TensorflowHubEmbeddings
,SagemakerEndpointEmbeddings
,HuggingFaceInstructEmbeddings
,SelfHostedEmbeddings
,SelfHostedHuggingFaceEmbeddings
,SelfHostedHuggingFaceInstructEmbeddings
,FakeEmbeddings
,AlephAlphaAsymmetricSemanticEmbedding
,AlephAlphaSymmetricSemanticEmbedding
: All of these classes are subclasses ofBaseLLM
.
-> Question: What classes are derived from the Chain class?
Answer: There are multiple classes that are derived from the Chain class. Some of them are:
- APIChain
- AnalyzeDocumentChain
- ChatVectorDBChain
- CombineDocumentsChain
- ConstitutionalChain
- ConversationChain
- GraphQAChain
- HypotheticalDocumentEmbedder
- LLMChain
- LLMCheckerChain
- LLMRequestsChain
- LLMSummarizationCheckerChain
- MapReduceChain
- OpenAPIEndpointChain
- PALChain
- QAWithSourcesChain
- RetrievalQA
- RetrievalQAWithSourcesChain
- SequentialChain
- SQLDatabaseChain
- TransformChain
- VectorDBQA
- VectorDBQAWithSourcesChain
There might be more classes that are derived from the Chain class as it is possible to create custom classes that extend the Chain class.
-> Question: What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?
Answer: All classes and functions in the ./langchain/utilities/
folder seem to have unit tests written for them.