Skip to main content

Scrape PDF's using python

Hi, guys welcome to this blog post, i hope you guys are doing well. In this post i will discuss about how to scrape any specific text data or tables from PDF's and what kind of problems one can face while scrapping the PDF data. 

The data trapped inside PDF are unstructured data and they can come from different sources like manually typed or system generated and depending on the source we have classified the PDF's into two categories 

  • Simple or readable PDF's.
  • Complex or scanned PDF's.
Simple or readable PDF's:

Simple PDF's can be of system generated or can come from data entry related sources and generally such kind of PDF's are less complicated and any kind of data can be easily extracted from such kind of PDFs. 

Complex or scanned PDF's:

On the other hand complex PDFs  or scanned PDFs are may come from system generated sources and generally are in scanned format and it is very difficult to handle the scanned PDFs and extracting data from it because sometimes they are so complex that while extracting the data one can face severe data loss because of their complex structure.


Methods to extract PDF data using Python :


There are several ways to extract data from PDF's but few are very useful and choosing the right way depends according to the requirement that you are following.

Since, I follow python programming here i will mention how to extract data from python using different ways. The ways i have mentioned are the real time work experience and are truly tested and works perfectly.

There are several libraries available but the best libraries that you can follow for extraction is:
  • Tabula
  • Camelot
Here, one thing that we need to keep in our mind is that in both libraries several features are available but we'll be using the features that are necessary and can get our work done.

Tabula:

Tabula is a python library which can be used to extract data from normal or readable PDF's. or semi-scanned PDF's and it can extract only tabular data from PDF's.

It has several features like scrapping the data from the all the PDF pages, scrapping using area or co-ordinate technique etc.

To install tabula in your system you can follow this syntax:

pip install tabuala-py

Since tabula uses a java run time environment to run the python script you need to download a java runtime environment from official site or you can follow up this link:

https://docs.oracle.com/goldengate/1212/gg-winux/GDRAD/java.htm#BGBFHBEA


After installing the run-time environment we can run our python code in any IDLE to get the tables from  the PDF.

Code to read all pages from a PDF file:

get_tables = tabula.read_pdf(pdf="Path", pages="all, encoding = 'ISO-8859-1')

After getting all PDF you can access the tables from each pages using array indexing. For 
example get_tables[0], get_tables[0] etc.

One thing you have to keep in your mind is that what ever the tables you will extract using either Tabula or Camelot the data in the tables will come uncleaned, not suitable data types so you have to use Pandas Library to clean and convert them into suitable format for your requirement.

Code to read data from pages one by one or a specific pages:

get_tables=tabula.read_pdf(padf="Path", pages="1, 2, 3, 4, ..", encoding="ISO-8859-1")


Code to read data from specific part of the pages using co-ordinates:

For this technique we need co-ordinates to extract data from the specific part of the whole pdf using tabula.

For this first we need co-ordinates which we have to generate first, and for this we need to download the tabula application, for this download the tabula application using the link:

https://tabula.technology/

After downloading the tabula in a folder go inside the folder and select the tabula.exe file which will start a local server on your computer and a new window will open in your default browser. There import your PDF file, and click on the extract button, it will navigate you to a new page there using the cursor you can select the area of the PDF where our table is located. Then select preview button and from the above bar change the option from CSV to Script file then click on export button and it will download a .sh file in your computer from, now open the file in an IDLE from there copy the co-ordinates and paste them separately somewhere. 

Example:

java -jar tabula-java.jar  -a 164.093,47.813,782.978,564.953 -p 4 "$1"

Here the highlighted are is the co-ordinates which we need to get our data. Then we will use this co-ordinates in our python code.

table = tabula.read_pdf(pdf=Path", pages="page number",
lattice=True, area=(158.13,387.45,340.83,756.63), encoding = 'ISO-8859-1')

Here you also have to use pandas and other python libraries like numpy,
datetime for data cleaning purpose and everything.



















Comments

Popular posts from this blog

How to Remove Dandruff: A Complete Guide

Dandruff can be an embarrassing and frustrating condition, but the good news is that it’s manageable. In this guide, we’ll explore what dandruff is, its causes, and the most effective ways to eliminate it. Whether you prefer home remedies or over-the-counter solutions, there’s something here for everyone. What is Dandruff? Dandruff is a common scalp condition characterized by flaking and itching. It occurs when the scalp sheds dead skin cells excessively, often due to dryness, sensitivity, or fungal infections. While it’s not harmful, it can be a nuisance and impact self-confidence. Causes of Dandruff Understanding the root causes of dandruff can help you choose the right treatment. Here are some common reasons: - Dry Skin: A dry scalp often leads to flaking, especially during winter months. - Sensitivity to Hair Products: Certain shampoos, conditioners, or styling products can irritate the scalp. - Fungal Infections: Malassezia, a type of yeast, thrives on oily scalps and can trigger ...

The Environmental Toll of Data Centers: Energy Consumption, Water Usage, and Carbon Emissions

Why Data Centers Are Danger To Environment ?     Data centers are critical for modern society because they serve as the backbone for modern infrastructure, to power modern business and technologies. They play crucial role to power modern internet, to host websites, applications and process customer data, storing huge volumes of data and powering e-commerce platforms. But with these great things there are some disadvantages are also related to data centers which makes them a threat to environment. Data centers helps in support cloud services, analytics, Storage, cloud computing, empowering streaming services like Amazon, Netflix, Facebook, You Tube, also AI and Machine learning rely on these data centers to process huge data to process business logics etc. But in order to do all these great tasks they need tremendous amount of energy and electricity to power networking, servers, storage equipment, cloud services and the infrastructure supporting these services. Data centers ae...

Revolutionizing Data Centers: Cutting-Edge Construction Techniques Reshaping the Digital Landscape

  In a world where a single data center can consume as much water in a day as a small city does in a year, the race to build more efficient digital fortresses is on. The humble data center, once a nondescript building humming with servers, has become a hotbed of architectural and engineering innovation. As these facilities evolve to meet the insatiable appetite for data processing and storage, they're reshaping the very foundations of construction technology. Modular Design: The Future of Data Center Architecture Gone are the days of painstakingly slow, brick-by-mortar builds. Today's data centers are rising from the ground at breakneck speeds, thanks to modular design. This isn't just a trend; it's a revolution, with up to 70% of facilities now being pieced together like high-tech Lego sets in factories before ever touching their final destination. The benefits are as stackable as the modules themselves: Speed demons: These prefab marvels sprint to completion 60% faste...