Grobid

GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.

It is particularly good for sturctured PDFs, like academic papers.

This loader uses GROBIB to parse PDFs into Documents that retain metadata associated with the section of text.

For users on Mac -

(Note: additional instructions can be found here.)

Install Java (Apple Silicon):

$ arch -arm64 brew install openjdk@11
$ brew --prefix openjdk@11
/opt/homebrew/opt/openjdk@ 11

In ~/.zshrc:

export JAVA_HOME=/opt/homebrew/opt/openjdk@11
export PATH=$JAVA_HOME/bin:$PATH

Then, in Terminal:

$ source ~/.zshrc

Confirm install:

$ which java
/opt/homebrew/opt/openjdk@11/bin/java
$ java -version 
openjdk version "11.0.19" 2023-04-18
OpenJDK Runtime Environment Homebrew (build 11.0.19+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.19+0, mixed mode)

Then, get Grobid:

$ curl -LO https://github.com/kermitt2/grobid/archive/0.7.3.zip
$ unzip 0.7.3.zip

Build

$ ./gradlew clean install

Then, run the server:

get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')

Now, we can use the data loader.

from langchain.document_loaders.parsers import GrobidParser
from langchain.document_loaders.generic import GenericLoader

API Reference:

GrobidParser from langchain.document_loaders.parsers
GenericLoader from langchain.document_loaders.generic

loader = GenericLoader.from_filesystem(
    "../Papers/",
    glob="*",
    suffixes=[".pdf"],
    parser=GrobidParser(segment_sentences=False),
)
docs = loader.load()

docs[3].page_content

    'Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g."Books -2TB" or "Social media conversations").There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.'

docs[3].metadata

    {'text': 'Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g."Books -2TB" or "Social media conversations").There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.',
     'para': '2',
     'bboxes': "[[{'page': '1', 'x': '317.05', 'y': '509.17', 'h': '207.73', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '522.72', 'h': '220.08', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '536.27', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '549.82', 'h': '218.65', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '563.37', 'h': '136.98', 'w': '9.46'}], [{'page': '1', 'x': '446.49', 'y': '563.37', 'h': '78.11', 'w': '9.46'}, {'page': '1', 'x': '304.69', 'y': '576.92', 'h': '138.32', 'w': '9.46'}], [{'page': '1', 'x': '447.75', 'y': '576.92', 'h': '76.66', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '590.47', 'h': '219.63', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '604.02', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '617.56', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '631.11', 'h': '220.18', 'w': '9.46'}]]",
     'pages': "('1', '1')",
     'section_title': 'Introduction',
     'section_number': '1',
     'paper_title': 'LLaMA: Open and Efficient Foundation Language Models',
     'file_path': '/Users/31treehaus/Desktop/Papers/2302.13971.pdf'}