DocQuery : Extracting data from documents using a query engine

DocQuery : Extracting data from documents using a query engine

DocQuery is a library and command-line tool that makes it easy to analyze semi-structured and unstructured documents (PDFs, scanned images, etc.) using large language models (LLMs). Then specify a question to ask DocQuery and point it at the document or documents.

Install Package :

pip install docquery
!apt install tesseract-ocr

With DocQuery scan, you can ask one or more questions about a single document or directory of files. With docquery scan, you can ask one or more questions about a single document or directory of files.

Quickstart :

from docquery import document, pipeline
p = pipeline('document-question-answering')
doc = document.load_document("/path/to/document.pdf")
for q in ["What is the invoice number?", "What is the invoice total?"]:
      print(q, p(question=q, **doc.context))

Use cases :

There are many use cases where DocQuery excels, including structured, semi-structured, and unstructured documents. There are many questions that you can ask about invoices, contracts, forms, emails, letters, receipts, and a lot of other things. You can also classify documents. As the model evolves, more modeling options will be offered, and the document types supported will expand.

GitHub Repo