URL
This covers how to load HTML documents from a list of URLs into a document format that we can use downstream.
from langchain.document_loaders import UnstructuredURLLoader
API Reference:
- UnstructuredURLLoader from
langchain.document_loaders
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]
Pass in ssl_verify=False with headers=headers to get past ssl_verification error.
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
Selenium URL Loader
This covers how to load HTML documents from a list of URLs using the SeleniumURLLoader
.
Using selenium allows us to load pages that require JavaScript to render.
Setup
To use the SeleniumURLLoader
, you will need to install selenium
and unstructured
.
from langchain.document_loaders import SeleniumURLLoader
API Reference:
- SeleniumURLLoader from
langchain.document_loaders
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = SeleniumURLLoader(urls=urls)
data = loader.load()
Playwright URL Loader
This covers how to load HTML documents from a list of URLs using the PlaywrightURLLoader
.
As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.
Setup
To use the PlaywrightURLLoader
, you will need to install playwright
and unstructured
. Additionally, you will need to install the Playwright Chromium browser:
# Install playwright
pip install "playwright"
pip install "unstructured"
playwright install
from langchain.document_loaders import PlaywrightURLLoader
API Reference:
- PlaywrightURLLoader from
langchain.document_loaders
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])
data = loader.load()