Skip to main content

Scrape PDF's using python

Hi, guys welcome to this blog post, i hope you guys are doing well. In this post i will discuss about how to scrape any specific text data or tables from PDF's and what kind of problems one can face while scrapping the PDF data. 

The data trapped inside PDF are unstructured data and they can come from different sources like manually typed or system generated and depending on the source we have classified the PDF's into two categories 

  • Simple or readable PDF's.
  • Complex or scanned PDF's.
Simple or readable PDF's:

Simple PDF's can be of system generated or can come from data entry related sources and generally such kind of PDF's are less complicated and any kind of data can be easily extracted from such kind of PDFs. 

Complex or scanned PDF's:

On the other hand complex PDFs  or scanned PDFs are may come from system generated sources and generally are in scanned format and it is very difficult to handle the scanned PDFs and extracting data from it because sometimes they are so complex that while extracting the data one can face severe data loss because of their complex structure.


Methods to extract PDF data using Python :


There are several ways to extract data from PDF's but few are very useful and choosing the right way depends according to the requirement that you are following.

Since, I follow python programming here i will mention how to extract data from python using different ways. The ways i have mentioned are the real time work experience and are truly tested and works perfectly.

There are several libraries available but the best libraries that you can follow for extraction is:
  • Tabula
  • Camelot
Here, one thing that we need to keep in our mind is that in both libraries several features are available but we'll be using the features that are necessary and can get our work done.

Tabula:

Tabula is a python library which can be used to extract data from normal or readable PDF's. or semi-scanned PDF's and it can extract only tabular data from PDF's.

It has several features like scrapping the data from the all the PDF pages, scrapping using area or co-ordinate technique etc.

To install tabula in your system you can follow this syntax:

pip install tabuala-py

Since tabula uses a java run time environment to run the python script you need to download a java runtime environment from official site or you can follow up this link:

https://docs.oracle.com/goldengate/1212/gg-winux/GDRAD/java.htm#BGBFHBEA


After installing the run-time environment we can run our python code in any IDLE to get the tables from  the PDF.

Code to read all pages from a PDF file:

get_tables = tabula.read_pdf(pdf="Path", pages="all, encoding = 'ISO-8859-1')

After getting all PDF you can access the tables from each pages using array indexing. For 
example get_tables[0], get_tables[0] etc.

One thing you have to keep in your mind is that what ever the tables you will extract using either Tabula or Camelot the data in the tables will come uncleaned, not suitable data types so you have to use Pandas Library to clean and convert them into suitable format for your requirement.

Code to read data from pages one by one or a specific pages:

get_tables=tabula.read_pdf(padf="Path", pages="1, 2, 3, 4, ..", encoding="ISO-8859-1")


Code to read data from specific part of the pages using co-ordinates:

For this technique we need co-ordinates to extract data from the specific part of the whole pdf using tabula.

For this first we need co-ordinates which we have to generate first, and for this we need to download the tabula application, for this download the tabula application using the link:

https://tabula.technology/

After downloading the tabula in a folder go inside the folder and select the tabula.exe file which will start a local server on your computer and a new window will open in your default browser. There import your PDF file, and click on the extract button, it will navigate you to a new page there using the cursor you can select the area of the PDF where our table is located. Then select preview button and from the above bar change the option from CSV to Script file then click on export button and it will download a .sh file in your computer from, now open the file in an IDLE from there copy the co-ordinates and paste them separately somewhere. 

Example:

java -jar tabula-java.jar  -a 164.093,47.813,782.978,564.953 -p 4 "$1"

Here the highlighted are is the co-ordinates which we need to get our data. Then we will use this co-ordinates in our python code.

table = tabula.read_pdf(pdf=Path", pages="page number",
lattice=True, area=(158.13,387.45,340.83,756.63), encoding = 'ISO-8859-1')

Here you also have to use pandas and other python libraries like numpy,
datetime for data cleaning purpose and everything.



















Comments

Popular posts from this blog

How to Remove Dandruff: A Complete Guide

Dandruff can be an embarrassing and frustrating condition, but the good news is that it’s manageable. In this guide, we’ll explore what dandruff is, its causes, and the most effective ways to eliminate it. Whether you prefer home remedies or over-the-counter solutions, there’s something here for everyone. What is Dandruff? Dandruff is a common scalp condition characterized by flaking and itching. It occurs when the scalp sheds dead skin cells excessively, often due to dryness, sensitivity, or fungal infections. While it’s not harmful, it can be a nuisance and impact self-confidence. Causes of Dandruff Understanding the root causes of dandruff can help you choose the right treatment. Here are some common reasons: - Dry Skin: A dry scalp often leads to flaking, especially during winter months. - Sensitivity to Hair Products: Certain shampoos, conditioners, or styling products can irritate the scalp. - Fungal Infections: Malassezia, a type of yeast, thrives on oily scalps and can trigger ...

The Environmental Toll of Data Centers: Energy Consumption, Water Usage, and Carbon Emissions

Why Data Centers Are Danger To Environment ?     Data centers are critical for modern society because they serve as the backbone for modern infrastructure, to power modern business and technologies. They play crucial role to power modern internet, to host websites, applications and process customer data, storing huge volumes of data and powering e-commerce platforms. But with these great things there are some disadvantages are also related to data centers which makes them a threat to environment. Data centers helps in support cloud services, analytics, Storage, cloud computing, empowering streaming services like Amazon, Netflix, Facebook, You Tube, also AI and Machine learning rely on these data centers to process huge data to process business logics etc. But in order to do all these great tasks they need tremendous amount of energy and electricity to power networking, servers, storage equipment, cloud services and the infrastructure supporting these services. Data centers ae...

All about data analysis and which programming language to choose to perform data analysis?

  What is data analysis ? Data analysis is the process of exploring, cleansing, transforming and modelling data in order to derive useful insight, supporting decision. Tools available for it ! There are two kinds of tools used in order to carry out data analysis: 1) Auto managed closed tools: These are the tools whose source code is not available, that is these are not open source. If you want to use these tools then you have to pay for them. Also, as these tools are not open source, if you want to learn these tools then you have to follow their documentation site. Though some auto managed tools have their free versions available.  Pros & Cons: Closed Source Expensive They are limited  Easy to learn Example: Tableau, Qlik View, Excel (Paid Version), Power BI (Paid Version), Zoho Analytics, SAS 2) Programming Languages: Then there are suitable programming languages which can derive the same result like auto managed closed tools.  Pros & Cons: These are open so...