Skip to main content

Grobid

This page covers how to use the Grobid to parse articles for LangChain. It is separated into two parts: installation and running the server

Installation and Setup

#Ensure You have Java installed !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java

#Clone and install the Grobid Repo import os !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install

#Run the server, get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')

You can now use the GrobidParser to produce documents

from langchain.document_loaders.parsers import GrobidParser
from langchain.document_loaders.generic import GenericLoader

#Produce chunks from article paragraphs
loader = GenericLoader.from_filesystem(
"/Users/31treehaus/Desktop/Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()

#Produce chunks from article sentences
loader = GenericLoader.from_filesystem(
"/Users/31treehaus/Desktop/Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()

API Reference:

Chunk metadata will include bboxes although these are a bit funky to parse, see https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/