Use LangChain, GPT and Activeloop's Deep Lake to work with code base
In this tutorial, we are going to use Langchain + Activeloop's Deep Lake with GPT to analyze the code base of the LangChain itself.
- Prepare data:
- Upload all python project files using the
. We will call these files the documents. - Split all documents to chunks using the
. - Embed chunks and upload them into the DeepLake using
- Upload all python project files using the
- Question-Answering:
- Build a chain from
- Prepare questions.
- Get answers running the chain.
- Build a chain from
Integration preparations
We need to set up keys for external services and install necessary python libraries.
#!python3 -m pip install --upgrade langchain deeplake openai
Set up OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate.
For full documentation of Deep Lake please follow and API reference
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass()
# Please manually enter OpenAI Key
Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the platform at
activeloop_token = getpass("Activeloop Token:")
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
Prepare data
Load all repository files. Here we assume this notebook is downloaded as the part of the langchain fork and we work with the python files of the langchain
If you want to use files from different repo, change root_dir
to the root dir of your repo.
ls "../../../.."
from langchain.document_loaders import TextLoader
root_dir = "../../../.."
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for file in filenames:
if file.endswith(".py") and "/.venv/" not in dirpath:
loader = TextLoader(os.path.join(dirpath, file), encoding="utf-8")
except Exception as e:
API Reference:
- TextLoader from
Then, chunk the files
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
API Reference:
- CharacterTextSplitter from
Then embed chunks and upload them to the DeepLake.
This can take several minutes.
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
API Reference:
- OpenAIEmbeddings from
from langchain.vectorstores import DeepLake
db = DeepLake.from_documents(
texts, embeddings, dataset_path=f"hub://{<org_id>}/langchain-code"
API Reference:
- DeepLake from
: You can also use Deep Lake's Managed Tensor Database as a hosting service and run queries there. In order to do so, it is necessary to specify the runtime parameter as {'tensor_db': True} during the creation of the vector store. This configuration enables the execution of queries on the Managed Tensor Database, rather than on the client side. It should be noted that this functionality is not applicable to datasets stored locally or in-memory. In the event that a vector store has already been created outside of the Managed Tensor Database, it is possible to transfer it to the Managed Tensor Database by following the prescribed steps.
# from langchain.vectorstores import DeepLake
# db = DeepLake.from_documents(
# texts, embeddings, dataset_path=f"hub://{<org_id>}/langchain-code", runtime={"tensor_db": True}
# )
# db
API Reference:
- DeepLake from
Question Answering
First load the dataset, construct the retriever, then construct the Conversational Chain
db = DeepLake(
retriever = db.as_retriever()
retriever.search_kwargs["distance_metric"] = "cos"
retriever.search_kwargs["fetch_k"] = 20
retriever.search_kwargs["maximal_marginal_relevance"] = True
retriever.search_kwargs["k"] = 20
You can also specify user defined functions using Deep Lake filters
def filter(x):
# filter based on source code
if "something" in x["text"].data()["value"]:
return False
# filter based on path e.g. extension
metadata = x["metadata"].data()["value"]
return "only_this" in metadata["source"] or "also_that" in metadata["source"]
### turn on below for custom filtering
# retriever.search_kwargs['filter'] = filter
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
model = ChatOpenAI(model_name="gpt-3.5-turbo") # 'ada' 'gpt-3.5-turbo' 'gpt-4',
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)
API Reference:
- ChatOpenAI from
- ConversationalRetrievalChain from
questions = [
"What is the class hierarchy?",
# "What classes are derived from the Chain class?",
# "What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?",
# "What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
chat_history = []
for question in questions:
result = qa({"question": question, "chat_history": chat_history})
chat_history.append((question, result["answer"]))
print(f"-> **Question**: {question} \n")
print(f"**Answer**: {result['answer']} \n")
-> Question: What is the class hierarchy?
Answer: There are several class hierarchies in the provided code, so I'll list a few:
is a subclass ofBaseModel
: All of these classes are subclasses ofBasePromptTemplate
: All of these classes are subclasses ofChain
is a subclass ofABC
: All of these classes are subclasses ofBaseTracer
: All of these classes are subclasses ofBaseLLM
-> Question: What classes are derived from the Chain class?
Answer: There are multiple classes that are derived from the Chain class. Some of them are:
- APIChain
- AnalyzeDocumentChain
- ChatVectorDBChain
- CombineDocumentsChain
- ConstitutionalChain
- ConversationChain
- GraphQAChain
- HypotheticalDocumentEmbedder
- LLMChain
- LLMCheckerChain
- LLMRequestsChain
- LLMSummarizationCheckerChain
- MapReduceChain
- OpenAPIEndpointChain
- PALChain
- QAWithSourcesChain
- RetrievalQA
- RetrievalQAWithSourcesChain
- SequentialChain
- SQLDatabaseChain
- TransformChain
- VectorDBQA
- VectorDBQAWithSourcesChain
There might be more classes that are derived from the Chain class as it is possible to create custom classes that extend the Chain class.
-> Question: What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?
Answer: All classes and functions in the ./langchain/utilities/
folder seem to have unit tests written for them.