Please describe the best method to extract specific text from a PDF and input that text into a C application text field

+1 vote
I would like to know the best method of extracting specific data (Text) from a PDF and inputting that data (Text) into a text field of a Tax software application.

A user will fill in the PDF form in different sections as "answers". EG - 1/ Question... [Text Answer]

I will need to copy or extract answer 1 and input answer 1 into the correct text field of the Tax software at 1/ Answer 1.

Any advice is appreciated.
Sep 14, 2020 in RPA by Chris
• 130 points
That all works great. No issues getting the text in bulk or specific areas of a PDF into a variable for display in a "Message Box". so step 1 is complete.

step 2/ Entering the text into a Tax Software application. Are you able to provide a link or reference for me to have a look at please?

I (think) I will need to loop the actions of the extracting of text from the PDF and inputting of text into the tax software multiple times until all the answers are complete and filled in.

If pdfs is structured format pdf integration and object cloning.
Im not sure how that would work.

If I have the PDF data (Text) I need in variables of type string including the int32 I convert to string, is it ok to use the "Type To" Activity so select the area in the Business application to send and display the string variable that has been captured from the PDF?

I have captured this from the PDF

Used this to display in Message Box just for testing:

"D1Code : " + D1Code.ToString+Environment.NewLine +"D1Total: "+D1Total.ToString+Environment.NewLine +"D2Total :"+D2Total.ToString+Environment.NewLine
+"D2Comments :"+ D2Comments.ToString+Environment.NewLine +"D3Total :"+ D3Total.ToString+Environment.NewLine

Are there different options?
Hey @Chris did that approach work?


It worked in "Notepad" as a test. I went backwards though. When trying to grab the specific text in the PDF I am now getting different errors like "Legacy Chrome ..." (even If I am using Edge)

and this one when I try to open the pdf and read using Edge.

So not sure how to fix this one. Is it a dependency? What is the best option for opening the pdf. As in should I use Chrome or Edge or other? Adobe Reader didnt seem to work either.


@Chris, could you please post the complete error that you have encountered?

Steps broken down.

Here is a snip if the PDF. The Red Boxes indicate the areas users can enter text and the areas I wish to GET. I made the PDF so I can change it to help the situation if needed.

I have stripped this to the bare bones. Process-Sequence



Variables set as strings or Int32 and converted via ToString in MessageBox

I have used Edge, Chrome, AVG Browser, and Adobe. I can use what is recommended.

Result - The Message Box is either displayed and the end of compiling but is empty or it has the words “Chrome Legacy Window”. If I grab the Full text it works and included the text from the users. No Problems. If I use OCR screen scraping with Teseract only it seems to work.

I am open to suggestions for the best way to do this as I have flexibility to do best practise for best results.

Keeping in mind I need to put the results or variables into another application after this.

You are getting this exception coz most probably you used a partial selector. Try using a complete selector instead of a partial selector.

Hope this helps!
The selector details didnt make any difference but I added a hot key stroke to make the PDF "Actual Size" before reading the text. This has worked consistently now. I can display the text (captured in a variable) in a Message Box or Text File.

Part 2/ I can open the Business Tax Application and navigate, and input the text in the correct field but it seems to loose focus after the 1st input of text and it also wont save in the application. It is not consistent either as it as sometimes it inputs the text in the wrong spot.

I am using "Type Into" activity. This seems to work but again the UI element seems to loose focus. I have checked the selectors on them and they have been validated. I have played with clicking in the element first or tabbing through to the correct field. Get inconsistent results.

any ideas (No errors) to help make this solid and work 100% of the time?

No answer to this question. Be the first to respond.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.

Related Questions In RPA

0 votes
1 answer

How to extract table and text from pdf file and load into Excel by using workbench commands in automation anywhere

Hey Dhruv, to extract table and text ...READ MORE

answered Oct 10, 2019 in RPA by Abha
• 28,040 points
0 votes
1 answer

RPA : How to extract specific data from scanned pdf and write into excel in blueprism?

Your query is similar to ...READ MORE

answered Mar 23, 2020 in RPA by Sirajul
• 59,050 points
0 votes
1 answer

how to extract information from a pdf using regex and send that information to an excel spreadsheet?

You can use the following high level activities to ...READ MORE

answered Apr 9, 2020 in RPA by Karan
• 19,610 points
0 votes
1 answer
0 votes
3 answers

How to write lines to a text file in R?

sink("outfile.txt") cat("hello") cat("\n" ...READ MORE

answered May 24, 2019 in Data Analytics by anonymous
0 votes
1 answer

Selection Bias

Selection bias is the bias introduced by the ...READ MORE

answered Jul 11, 2018 in Data Analytics by CodingByHeart77
• 3,720 points
0 votes
1 answer

Using Real Time flume Data for Analysis

By using MorphlineSolrSink we can extract, transform ...READ MORE

answered Jul 17, 2018 in Database by kurt_cobain
• 9,390 points
0 votes
1 answer

Channel in Flume

 A transient store that receives the events ...READ MORE

answered Jul 17, 2018 in Big Data Hadoop by Ashish
• 2,650 points