Document loaders
ποΈ Etherscan Loader
Overview
ποΈ acreom
acreom is a dev-first knowledge base with tasks running on local markdown files.
ποΈ Airbyte JSON
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
ποΈ Airtable
* Get your API key here.
ποΈ Alibaba Cloud MaxCompute
Alibaba Cloud MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.
ποΈ Apify Dataset
Apify Dataset is a scaleable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of Apify Actorsβserverless cloud programs for varius web scraping, crawling, and data extraction use cases.
ποΈ Arxiv
arXiv is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
ποΈ AsyncHtmlLoader
AsyncHtmlLoader loads raw HTML from a list of urls concurrently.
ποΈ AWS S3 Directory
Amazon Simple Storage Service (Amazon S3) is an object storage service
ποΈ AWS S3 File
Amazon Simple Storage Service (Amazon S3) is an object storage service.
ποΈ AZLyrics
AZLyrics is a large, legal, every day growing collection of lyrics.
ποΈ Azure Blob Storage Container
Azure Blob Storage is Microsoft's object storage solution for the cloud. Blob Storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data.
ποΈ Azure Blob Storage File
Azure Files offers fully managed file shares in the cloud that are accessible via the industry standard Server Message Block (SMB) protocol, Network File System (NFS) protocol, and Azure Files REST API.
ποΈ BibTeX
BibTeX is a file format and reference management system commonly used in conjunction with LaTeX typesetting. It serves as a way to organize and store bibliographic information for academic and research documents.
ποΈ BiliBili
Bilibili is one of the most beloved long-form video sites in China.
ποΈ Blackboard
Blackboard Learn (previously the Blackboard Learning Management System) is a web-based virtual learning environment and learning management system developed by Blackboard Inc. The software features course management, customizable open architecture, and scalable design that allows integration with student information systems and authentication protocols. It may be installed on local servers, hosted by Blackboard ASP Solutions, or provided as Software as a Service hosted on Amazon Web Services. Its main purposes are stated to include the addition of online elements to courses traditionally delivered face-to-face and development of completely online courses with few or no face-to-face meetings
ποΈ Blockchain
Overview
ποΈ Brave Search
Brave Search is a search engine developed by Brave Software.
ποΈ Browserless
Browserless is a service that allows you to run headless Chrome instances in the cloud. It's a great way to run browser-based automation at scale without having to worry about managing your own infrastructure.
ποΈ ChatGPT Data
ChatGPT is an artificial intelligence (AI) chatbot developed by OpenAI.
ποΈ College Confidential
College Confidential gives information on 3,800+ colleges and universities.
ποΈ Confluence
Confluence is a wiki collaboration platform that saves and organizes all of the project-related material. Confluence is a knowledge base that primarily handles content management activities.
ποΈ CoNLL-U
CoNLL-U is revised version of the CoNLL-X format. Annotations are encoded in plain text files (UTF-8, normalized to NFC, using only the LF character as line break, including an LF character at the end of file) with three types of lines:
ποΈ Copy Paste
This notebook covers how to load a document object from something you just want to copy and paste. In this case, you don't even need to use a DocumentLoader, but rather can just construct the Document directly.
ποΈ CSV
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
ποΈ Cube Semantic Layer
This notebook demonstrates the process of retrieving Cube's data model metadata in a format suitable for passing to LLMs as embeddings, thereby enhancing contextual information.
ποΈ Datadog Logs
Datadog is a monitoring and analytics platform for cloud-scale applications.
ποΈ Diffbot
Unlike traditional web scraping tools, Diffbot doesn't require any rules to read the content on a page.
ποΈ Discord
Discord is a VoIP and instant messaging social platform. Users have the ability to communicate with voice calls, video calls, text messaging, media and files in private chats or as part of communities called "servers". A server is a collection of persistent chat rooms and voice channels which can be accessed via invite links.
ποΈ Docugami
This notebook covers how to load documents from Docugami. It provides the advantages of using this system over alternative data loaders.
ποΈ DuckDB
DuckDB is an in-process SQL OLAP database management system.
ποΈ Email
This notebook shows how to load email (.eml) or Microsoft Outlook (.msg) files.
ποΈ Embaas
embaas is a fully managed NLP API service that offers features like embedding generation, document text extraction, document to embeddings and more. You can choose a variety of pre-trained models.
ποΈ EPub
EPUB is an e-book file format that uses the ".epub" file extension. The term is short for electronic publication and is sometimes styled ePub. EPUB is supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers.
ποΈ EverNote
EverNote is intended for archiving and creating notes in which photos, audio and saved web content can be embedded. Notes are stored in virtual "notebooks" and can be tagged, annotated, edited, searched, and exported.
ποΈ example_data
1 items
ποΈ Microsoft Excel
The UnstructuredExcelLoader is used to load Microsoft Excel files. The loader works with both .xlsx and .xls files. The page content will be the raw text of the Excel file. If you use the loader in "elements" mode, an HTML representation of the Excel file will be available in the document metadata under the textashtml key.
ποΈ Facebook Chat
Messenger) is an American proprietary instant messaging app and platform developed by Meta Platforms. Originally developed as Facebook Chat in 2008, the company revamped its messaging service in 2010.
ποΈ Fauna
Fauna is a Document Database.
ποΈ Figma
Figma is a collaborative web application for interface design.
ποΈ Geopandas
Geopandas is an open source project to make working with geospatial data in python easier.
ποΈ Git
Git is a distributed version control system that tracks changes in any set of computer files, usually used for coordinating work among programmers collaboratively developing source code during software development.
ποΈ GitBook
GitBook is a modern documentation platform where teams can document everything from products to internal knowledge bases and APIs.
ποΈ GitHub
This notebooks shows how you can load issues and pull requests (PRs) for a given repository on GitHub. We will use the LangChain Python repository as an example.
ποΈ Google BigQuery
Google BigQuery is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data.
ποΈ Google Cloud Storage Directory
Google Cloud Storage is a managed service for storing unstructured data.
ποΈ Google Cloud Storage File
Google Cloud Storage is a managed service for storing unstructured data.
ποΈ Google Drive
Google Drive is a file storage and synchronization service developed by Google.
ποΈ Grobid
GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.
ποΈ Gutenberg
Project Gutenberg is an online library of free eBooks.
ποΈ Hacker News
Hacker News (sometimes abbreviated as HN) is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity."
ποΈ HuggingFace dataset
The Hugging Face Hub is home to over 5,000 datasets in more than 100 languages that can be used for a broad range of tasks across NLP, Computer Vision, and Audio. They used for a diverse range of tasks such as translation,
ποΈ iFixit
iFixit is the largest, open repair community on the web. The site contains nearly 100k repair manuals, 200k Questions & Answers on 42k devices, and all the data is licensed under CC-BY-NC-SA 3.0.
ποΈ Images
This covers how to load images such as JPG or PNG into a document format that we can use downstream.
ποΈ Image captions
By default, the loader utilizes the pre-trained Salesforce BLIP image captioning model.
ποΈ IMSDb
IMSDb is the Internet Movie Script Database.
ποΈ Iugu
Iugu is a Brazilian services and software as a service (SaaS) company. It offers payment-processing software and application programming interfaces for e-commerce websites and mobile applications.
ποΈ Joplin
Joplin is an open source note-taking app. Capture your thoughts and securely access them from any device.
ποΈ Jupyter Notebook
Jupyter Notebook (formerly IPython Notebook) is a web-based interactive computational environment for creating notebook documents.
ποΈ LarkSuite (FeiShu)
LarkSuite is an enterprise collaboration platform developed by ByteDance.
ποΈ Mastodon
Mastodon is a federated social media and social networking service.
ποΈ MediaWikiDump
MediaWiki XML Dumps contain the content of a wiki (wiki pages with all their revisions), without the site-related data. A XML dump does not create a full backup of the wiki database, the dump does not contain user accounts, images, edit logs, etc.
ποΈ MergeDocLoader
Merge the documents returned from a set of specified data loaders.
ποΈ mhtml
MHTML is a is used both for emails but also for archived webpages. MHTML, sometimes referred as MHT, stands for MIME HTML is a single file in which entire webpage is archived. When one saves a webpage as MHTML format, this file extension will contain HTML code, images, audio files, flash animation etc.
ποΈ Microsoft OneDrive
Microsoft OneDrive (formerly SkyDrive) is a file hosting service operated by Microsoft.
ποΈ Microsoft PowerPoint
Microsoft PowerPoint is a presentation program by Microsoft.
ποΈ Microsoft Word
Microsoft Word is a word processor developed by Microsoft.
ποΈ Modern Treasury
Modern Treasury simplifies complex payment operations. It is a unified platform to power products and processes that move money.
ποΈ Notion DB 1/2
Notion is a collaboration platform with modified Markdown support that integrates kanban boards, tasks, wikis and databases. It is an all-in-one workspace for notetaking, knowledge and data management, and project and task management.
ποΈ Notion DB 2/2
Notion is a collaboration platform with modified Markdown support that integrates kanban boards, tasks, wikis and databases. It is an all-in-one workspace for notetaking, knowledge and data management, and project and task management.
ποΈ Obsidian
Obsidian is a powerful and extensible knowledge base
ποΈ Open Document Format (ODT)
The Open Document Format for Office Applications (ODF), also known as OpenDocument, is an open file format for word processing documents, spreadsheets, presentations and graphics and using ZIP-compressed XML files. It was developed with the aim of providing an open, XML-based file format specification for office applications.
ποΈ Open City Data
Socrata provides an API for city open data.
ποΈ Org-mode
A Org Mode document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
ποΈ Pandas DataFrame
This notebook goes over how to load data from a pandas DataFrame.
ποΈ Psychic
This notebook covers how to load documents from Psychic. See here for more details.
ποΈ PySpark DataFrame Loader
This notebook goes over how to load data from a PySpark DataFrame.
ποΈ ReadTheDocs Documentation
Read the Docs is an open-sourced free software documentation hosting platform. It generates documentation written with the Sphinx documentation generator.
ποΈ Recursive URL Loader
We may want to process load all URLs under a root directory.
ποΈ Reddit
Reddit is an American social news aggregation, content rating, and discussion website.
ποΈ Roam
ROAM is a note-taking tool for networked thought, designed to create a personal knowledge base.
ποΈ Rockset
Rockset is a real-time analytics database which enables queries on massive, semi-structured data without operational burden. With Rockset, ingested data is queryable within one second and analytical queries against that data typically execute in milliseconds. Rockset is compute optimized, making it suitable for serving high concurrency applications in the sub-100TB range (or larger than 100s of TBs with rollups).
ποΈ RST
A reStructured Text (RST) file is a file format for textual data used primarily in the Python programming language community for technical documentation.
ποΈ Sitemap
Extends from the WebBaseLoader, SitemapLoader loads a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document.
ποΈ Slack
Slack is an instant messaging program.
ποΈ Snowflake
This notebooks goes over how to load documents from Snowflake
ποΈ Source Code
This notebook covers how to load source code files using a special approach with language parsing: each top-level function and class in the code is loaded into separate documents. Any remaining code top-level code outside the already loaded functions and classes will be loaded into a seperate document.
ποΈ Spreedly
Spreedly is a service that allows you to securely store credit cards and use them to transact against any number of payment gateways and third party APIs. It does this by simultaneously providing a card tokenization/vault service as well as a gateway and receiver integration service. Payment methods tokenized by Spreedly are stored at Spreedly, allowing you to independently store a card and then pass that card to different end points based on your business requirements.
ποΈ Stripe
Stripe is an Irish-American financial services and software as a service (SaaS) company. It offers payment-processing software and application programming interfaces for e-commerce websites and mobile applications.
ποΈ Subtitle
The SubRip file format is described on the Matroska multimedia container format website as "perhaps the most basic of all subtitle formats." SubRip (SubRip Text) files are named with the extension .srt, and contain formatted lines of plain text in groups separated by a blank line. Subtitles are numbered sequentially, starting at 1. The timecode format used is hoursseconds,milliseconds with time units fixed to two zero-padded digits and fractions fixed to three zero-padded digits (0000,000). The fractional separator used is the comma, since the program was written in France.
ποΈ Telegram
Telegram Messenger is a globally accessible freemium, cross-platform, encrypted, cloud-based and centralized instant messaging service. The application also provides optional end-to-end encrypted chats and video calling, VoIP, file sharing and several other features.
ποΈ Tencent COS Directory
This covers how to load document objects from a Tencent COS Directory.
ποΈ Tencent COS File
This covers how to load document object from a Tencent COS File.
ποΈ 2Markdown
2markdown service transforms website content into structured markdown files.
ποΈ TOML
TOML is a file format for configuration files. It is intended to be easy to read and write, and is designed to map unambiguously to a dictionary. Its specification is open-source. TOML is implemented in many programming languages. The name TOML is an acronym for "Tom's Obvious, Minimal Language" referring to its creator, Tom Preston-Werner.
ποΈ Trello
Trello is a web-based project management and collaboration tool that allows individuals and teams to organize and track their tasks and projects. It provides a visual interface known as a "board" where users can create lists and cards to represent their tasks and activities.
ποΈ TSV
A tab-separated values (TSV) file is a simple, text-based file format for storing tabular data.[3] Records are separated by newlines, and values within a record are separated by tab characters.
ποΈ Twitter
Twitter is an online social media and social networking service.
ποΈ Unstructured File
This notebook covers how to use Unstructured package to load files of many types. Unstructured currently supports loading of text files, powerpoints, html, pdfs, images, and more.
ποΈ URL
This covers how to load HTML documents from a list of URLs into a document format that we can use downstream.
ποΈ Weather
OpenWeatherMap is an open source weather service provider
ποΈ WebBaseLoader
This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader
ποΈ WhatsApp Chat
WhatsApp (also called WhatsApp Messenger) is a freeware, cross-platform, centralized instant messaging (IM) and voice-over-IP (VoIP) service. It allows users to send text and voice messages, make voice and video calls, and share images, documents, user locations, and other content.
ποΈ Wikipedia
Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki. Wikipedia is the largest and most-read reference work in history.
ποΈ XML
The UnstructuredXMLLoader is used to load XML files. The loader works with .xml files. The page content will be the text extracted from the XML tags.
ποΈ Xorbits Pandas DataFrame
This notebook goes over how to load data from a xorbits.pandas DataFrame.
ποΈ Loading documents from a YouTube url
Building chat or QA applications on YouTube videos is a topic of high interest.
ποΈ YouTube transcripts
YouTube is an online video sharing and social media platform created by Google.