Convert PDF into excel data

0 votes
I have a PDF which is having the below data attached as an image. How can I convert it into a tabular format as CSV/excels?
Oct 2, 2022 in Others by Kithuzzz
• 38,010 points
417 views

1 answer to this question.

0 votes

To convert data from an image in a PDF to a tabular format using OpenCV and Python, you typically go through these steps:

  1. Read the Image: Use OpenCV to read the image extracted from the PDF.
  2. Preprocess the Image: Apply various preprocessing techniques like converting to grayscale, thresholding, etc., to enhance the image for OCR.
  3. OCR (Optical Character Recognition): Use an OCR tool, like Tesseract OCR, to extract text from the preprocessed image.
  4. Data Parsing and Structuring: Parse the extracted text to structure it into a tabular format. This might require custom coding depending on the layout of the data in the image.
  5. Export to CSV/Excel: Finally, use a Python library like Pandas to export the structured data into a CSV or Excel file.

Here's a basic outline of how you could do this in Python:

import cv2
import pytesseract
import pandas as pd

# Load the image
image = cv2.imread('path_to_your_image.jpg')

# Preprocess the image (example: convert to grayscale)
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# You can add more preprocessing steps like thresholding here

# Use Tesseract OCR to extract text
text = pytesseract.image_to_string(gray_image)

# Parse the text into a structured format (this part depends on your specific data)
# Example: split the text into lines and then into columns
lines = text.split('\n')
data = [line.split() for line in lines]

# Convert the structured data into a DataFrame
df = pd.DataFrame(data, columns=['Column1', 'Column2', 'Column3'])  # Adjust the columns as per your data

# Save the DataFrame to a CSV file
df.to_csv('output.csv', index=False)

Please replace 'path_to_your_image.jpg' with the path to your image file and adjust the column names and data parsing logic according to your specific data format.

You need to have Python, OpenCV (opencv-python), Pytesseract (pytesseract), and Pandas (pandas) installed on your machine to run this script.

To fine-tune this process, you might need to experiment with different image preprocessing techniques and adjust the data parsing logic to match the layout of your data.

To learn more check OpenCV Tutorial with Python.

answered Oct 3, 2022 by narikkadan
• 63,420 points

Related Questions In Others

0 votes
1 answer

How to convert data from txt files to Excel files using python

Hi , there are few steps to ...READ MORE

answered Feb 16, 2022 in Others by Edureka
• 13,670 points
13,104 views
0 votes
1 answer

Codeigniter convert excel file to pdf

This is a basic php script for ...READ MORE

answered Sep 27, 2022 in Others by narikkadan
• 63,420 points
1,004 views
0 votes
1 answer

How to convert PDF to Excel in C#?

Solutions a) Cometdocs makes a free online conversion from PDF ...READ MORE

answered Oct 1, 2022 in Others by narikkadan
• 63,420 points
596 views
0 votes
1 answer

How can I convert excel to PDF by Libreoffice and keep all format from excel file?

"Times New Roman" typeface does not have ...READ MORE

answered Oct 3, 2022 in Others by narikkadan
• 63,420 points
1,214 views
0 votes
1 answer

Print chosen worksheets in excel files to pdf in python

In the simplest form: import win32com.client o = win32com.client.Dispatch("Excel.Application") o.Visible ...READ MORE

answered Sep 24, 2022 in Others by narikkadan
• 63,420 points
2,565 views
0 votes
1 answer

Java Spring - Writing Excel file and converting to PDF

Since you are using Spring I suggest ...READ MORE

answered Sep 26, 2022 in Others by narikkadan
• 63,420 points
1,831 views
0 votes
1 answer

Convert Excel to PDF issue with documents4j

MS Excel may not always be used ...READ MORE

answered Sep 26, 2022 in Others by narikkadan
• 63,420 points
1,050 views
0 votes
1 answer

Converting all tabs of excel sheet to PDF

Using VBA, try it like this, for ...READ MORE

answered Sep 26, 2022 in Others by narikkadan
• 63,420 points
1,243 views
0 votes
1 answer

Downloading Tableau data into Excel (scientific notation column)

Due to Excel's internal floating point numbering ...READ MORE

answered Sep 21, 2022 in Others by narikkadan
• 63,420 points
491 views
0 votes
1 answer

Convert Excel and Word files to PDF Using ruby

 You can combine some: For excel files - ...READ MORE

answered Sep 26, 2022 in Others by narikkadan
• 63,420 points
863 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP