Model Overview
The Azure form recognizer prebuilt models combine OCR with deep learning models to identify and extract predefined text and data fields. Output return in json format. Form Recognizer V2.1 supports invoices, receipt, Id document and business card models.
Model and description
Document Analysis :
Extract printed and handwritten text lines, words, locations and detect languages. You can also extract text, tables, and key-value pairs from a structure as well as named entities.
Pre-built :
W-2 : Extract employee, employer, wage information etc.
Invoice : Extract key information from English and Spanish invoices.
Receipts : Extract key information from English receipts.
Id document : Extract key information from US driver licenses and international passport.
Business card : Extract key information from English business card.
Custom :
Extract data from forms and documents specific to your business. This model are trained for your distinct data and use cases. There is a composed model too, which is a collection of custom models and assigns them to a single model built from your form types.
Install Packages
pip install azure-ai-formrecognizer
pip install azure-identity
pip install azure-core
pip install azure-storage-blob
Authenticate the client
It is necessary to create an instance of a client in order to interact with the Form Recognizer service. The client object requires an endpoint and credential.
QuickStart
Connect to Azure Storage Container
from azure.storage.blob import ContainerClient
container_url = "https://xxxxxxxxx/testinvoice"
container = ContainerClient.from_container_url(container_url)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
print(blob_url)
Enable Congitive Services
import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
endpoint = "https://xxxxxxxx/"
credential = AzureKeyCredential("xxxxxxxxxxxxxxxx")
form_recognizer_client = FormRecognizerClient(endpoint, credential)
print(form_recognizer_client)
Create dataframe of invoice data
import pandas as pd
field_list = ["InvoiceId", "VendorName", "VendorAddress", "CustomerName", "CustomerAddress", "CustomerAddressRecipient", "InvoiceDate", "InvoiceTotal", "DueDate"]
df = pd.DataFrame(columns=field_list)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = form_recognizer_client.begin_recognize_invoices_from_url(blob_url)
# poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
invoices = poller.result()
print("Scanning " + blob.name + "...")
for idx, invoice in enumerate(invoices):
single_df = pd.DataFrame(columns=field_list)
for field in field_list:
entry = invoice.fields.get(field)
if entry:
single_df[field] = [entry.value]
single_df['FileName'] = blob.name
df = df.append(single_df)
df = df.reset_index(drop=True)
df.to_csv('invoice_data.csv')