Python client library samples for Azure Form Recognizer

Python client library samples for Azure Form Recognizer

Model Overview

The Azure form recognizer prebuilt models combine OCR with deep learning models to identify and extract predefined text and data fields. Output return in json format. Form Recognizer V2.1 supports invoices, receipt, Id document and business card models.

Model and description

  • Document Analysis :

    Extract printed and handwritten text lines, words, locations and detect languages. You can also extract text, tables, and key-value pairs from a structure as well as named entities.

  • Pre-built :

    1. W-2 : Extract employee, employer, wage information etc.

    2. Invoice : Extract key information from English and Spanish invoices.

    3. Receipts : Extract key information from English receipts.

    4. Id document : Extract key information from US driver licenses and international passport.

    5. Business card : Extract key information from English business card.

  • Custom :

    Extract data from forms and documents specific to your business. This model are trained for your distinct data and use cases. There is a composed model too, which is a collection of custom models and assigns them to a single model built from your form types.

Install Packages

pip install azure-ai-formrecognizer
pip install azure-identity
pip install azure-core
pip install azure-storage-blob

Authenticate the client

It is necessary to create an instance of a client in order to interact with the Form Recognizer service. The client object requires an endpoint and credential.

QuickStart

Connect to Azure Storage Container

from azure.storage.blob import ContainerClient

container_url = "https://xxxxxxxxx/testinvoice"
container = ContainerClient.from_container_url(container_url)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  print(blob_url)

Enable Congitive Services

import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential

endpoint = "https://xxxxxxxx/"
credential = AzureKeyCredential("xxxxxxxxxxxxxxxx")
form_recognizer_client = FormRecognizerClient(endpoint, credential)
print(form_recognizer_client)

Create dataframe of invoice data

import pandas as pd

field_list = ["InvoiceId", "VendorName", "VendorAddress", "CustomerName", "CustomerAddress", "CustomerAddressRecipient", "InvoiceDate", "InvoiceTotal", "DueDate"]
df = pd.DataFrame(columns=field_list)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name

  poller = form_recognizer_client.begin_recognize_invoices_from_url(blob_url)
  # poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
  invoices = poller.result()
  print("Scanning " + blob.name + "...")

  for idx, invoice in enumerate(invoices):
      single_df = pd.DataFrame(columns=field_list)

      for field in field_list:
        entry = invoice.fields.get(field)

        if entry:
          single_df[field] = [entry.value]

      single_df['FileName'] = blob.name
      df = df.append(single_df)

df = df.reset_index(drop=True)
df.to_csv('invoice_data.csv')

GitHub Repo