UiPath: I want to capture specific data from unstructured scanned PDF files (Invoices) and export data in excel sheet.

+1 vote
We have business invoices in form of scanned PDFs and PDFs are from different vendors so they are different to each other, we want to export all data like Invoice Number, Invoice Date, Items details in tabular form.
I am using UiPath RPA tool for this problem.
Thanks,

Ashish Soni
Mar 20 in RPA by Ashish
• 130 points

edited Mar 20 by Ashish 164 views
So your scanned PDFs are in image format. Is that right?
Files are in .pdf format (File Properties-->Tagged-->OFF).
Sir,I have Bluprism platform.So it is helpful to me if you find the solution for this problem in Blueprism.

Hey @Subhiksha, If u want to extract data from PDF file to excel then,

  • map it open through a browser,

  • then select through which PDF file need to extract data from,

  • u can read the data from the pdf after u spy those data with html or region mode,

1 answer to this question.

0 votes

You could do the following:

  1.  Install UiPath.PDF.Activities.

  2. Once you install that package, you will be able to see PDF activities in activities pane.

  3. You can use Read PDF Text or Read PDF Text with OCR activities for your requirement.

  4. You can then write it into excel using write range or write cell activity

For more info refer to https://www.edureka.co/blog/uipath-pdf-data-extraction/

answered Mar 20 by Sirajul
• 50,440 points
Thanks for answer..these activities will fetch whole document data but my requirement is to fetch specific data whose position is not fixed in PDF. For example in one PDF invoice number displayed on right top and other PDF it shows on middle of the document. My problem is candidate for IntelligentOCR I guess. I am not sure.

I guess you should probably use GET OCR TEXT activity you will able to find specific field value. Have a look at this for more details: https://docs.uipath.com/studio/docs/example-of-using-ocr-and-image-automation

I used Get OCR Text and it successfully fetches specific value BUT how can we fetch table having multiple row/column from PDF. Because this activity provide facility to select specific text. Suppose in one PDF only one product is there and other PDF two or more than two products are listed. In this case script will fail.

PDF #1 has following table:

Item Number Item Quantity Description Price
#001 500 Pencil 1000

PDF #2 has following table:

Item Number Item Quantity Description Price
#004 1 Dell Laptop 50000
#005 1 HP Laptp 40000

In that case, First read all the pdf’s in a folder using Directory.GetFiles(folderPath,"*.pdf",SearchOption.AllDirectories)

You could look into https://forum.uipath.com/t/how-to-read-multiple-fields-from-a-scanned-pdf/66271 for more information.