Skip to main content

Scrape PDF's using python

Hi, guys welcome to this blog post, i hope you guys are doing well. In this post i will discuss about how to scrape any specific text data or tables from PDF's and what kind of problems one can face while scrapping the PDF data. 

The data trapped inside PDF are unstructured data and they can come from different sources like manually typed or system generated and depending on the source we have classified the PDF's into two categories 

  • Simple or readable PDF's.
  • Complex or scanned PDF's.
Simple or readable PDF's:

Simple PDF's can be of system generated or can come from data entry related sources and generally such kind of PDF's are less complicated and any kind of data can be easily extracted from such kind of PDFs. 

Complex or scanned PDF's:

On the other hand complex PDFs  or scanned PDFs are may come from system generated sources and generally are in scanned format and it is very difficult to handle the scanned PDFs and extracting data from it because sometimes they are so complex that while extracting the data one can face severe data loss because of their complex structure.


Methods to extract PDF data using Python :


There are several ways to extract data from PDF's but few are very useful and choosing the right way depends according to the requirement that you are following.

Since, I follow python programming here i will mention how to extract data from python using different ways. The ways i have mentioned are the real time work experience and are truly tested and works perfectly.

There are several libraries available but the best libraries that you can follow for extraction is:
  • Tabula
  • Camelot
Here, one thing that we need to keep in our mind is that in both libraries several features are available but we'll be using the features that are necessary and can get our work done.

Tabula:

Tabula is a python library which can be used to extract data from normal or readable PDF's. or semi-scanned PDF's and it can extract only tabular data from PDF's.

It has several features like scrapping the data from the all the PDF pages, scrapping using area or co-ordinate technique etc.

To install tabula in your system you can follow this syntax:

pip install tabuala-py

Since tabula uses a java run time environment to run the python script you need to download a java runtime environment from official site or you can follow up this link:

https://docs.oracle.com/goldengate/1212/gg-winux/GDRAD/java.htm#BGBFHBEA


After installing the run-time environment we can run our python code in any IDLE to get the tables from  the PDF.

Code to read all pages from a PDF file:

get_tables = tabula.read_pdf(pdf="Path", pages="all, encoding = 'ISO-8859-1')

After getting all PDF you can access the tables from each pages using array indexing. For 
example get_tables[0], get_tables[0] etc.

One thing you have to keep in your mind is that what ever the tables you will extract using either Tabula or Camelot the data in the tables will come uncleaned, not suitable data types so you have to use Pandas Library to clean and convert them into suitable format for your requirement.

Code to read data from pages one by one or a specific pages:

get_tables=tabula.read_pdf(padf="Path", pages="1, 2, 3, 4, ..", encoding="ISO-8859-1")


Code to read data from specific part of the pages using co-ordinates:

For this technique we need co-ordinates to extract data from the specific part of the whole pdf using tabula.

For this first we need co-ordinates which we have to generate first, and for this we need to download the tabula application, for this download the tabula application using the link:

https://tabula.technology/

After downloading the tabula in a folder go inside the folder and select the tabula.exe file which will start a local server on your computer and a new window will open in your default browser. There import your PDF file, and click on the extract button, it will navigate you to a new page there using the cursor you can select the area of the PDF where our table is located. Then select preview button and from the above bar change the option from CSV to Script file then click on export button and it will download a .sh file in your computer from, now open the file in an IDLE from there copy the co-ordinates and paste them separately somewhere. 

Example:

java -jar tabula-java.jar  -a 164.093,47.813,782.978,564.953 -p 4 "$1"

Here the highlighted are is the co-ordinates which we need to get our data. Then we will use this co-ordinates in our python code.

table = tabula.read_pdf(pdf=Path", pages="page number",
lattice=True, area=(158.13,387.45,340.83,756.63), encoding = 'ISO-8859-1')

Here you also have to use pandas and other python libraries like numpy,
datetime for data cleaning purpose and everything.



















Comments

Popular posts from this blog

What is IoT (Internet Of Things)...???

This world has changed a lot, since the very beginning. Humans has made an significant development on this earth as compared to the other species. But it is not the thing that makes us so special on this earth, what makes us so special is the power of thinking that we have bestowed with. This is the thing that makes us so special on this planet earth. But are we humans really deserves this level of intelligence, although we have it naturally no doubt !!! For this... If we go back to the history and start digging our culture. how we survived from the great calamities, how we changed the world, how we developed our society, all the answers lies in it....!!! Biologically if we see every species on this is so well designed is to dominate the other species by making them on the top one but natures engineering is so perfect that it created a food chain, this food chain works so well that it maintains a perfect relationship between the apex ones and those species are at the bottom of the ...

All about data analysis and which programming language to choose to perform data analysis?

  What is data analysis ? Data analysis is the process of exploring, cleansing, transforming and modelling data in order to derive useful insight, supporting decision. Tools available for it ! There are two kinds of tools used in order to carry out data analysis: 1) Auto managed closed tools: These are the tools whose source code is not available, that is these are not open source. If you want to use these tools then you have to pay for them. Also, as these tools are not open source, if you want to learn these tools then you have to follow their documentation site. Though some auto managed tools have their free versions available.  Pros & Cons: Closed Source Expensive They are limited  Easy to learn Example: Tableau, Qlik View, Excel (Paid Version), Power BI (Paid Version), Zoho Analytics, SAS 2) Programming Languages: Then there are suitable programming languages which can derive the same result like auto managed closed tools.  Pros & Cons: These are open so...