Hi, guys welcome to this blog post, i hope you guys are doing well. In this post i will discuss about how to scrape any specific text data or tables from PDF's and what kind of problems one can face while scrapping the PDF data.
The data trapped inside PDF are unstructured data and they can come from different sources like manually typed or system generated and depending on the source we have classified the PDF's into two categories
- Simple or readable PDF's.
- Complex or scanned PDF's.
Simple or readable PDF's:
Simple PDF's can be of system generated or can come from data entry related sources and generally such kind of PDF's are less complicated and any kind of data can be easily extracted from such kind of PDFs.
Complex or scanned PDF's:
On the other hand complex PDFs or scanned PDFs are may come from system generated sources and generally are in scanned format and it is very difficult to handle the scanned PDFs and extracting data from it because sometimes they are so complex that while extracting the data one can face severe data loss because of their complex structure.
There are several ways to extract data from PDF's but few are very useful and choosing the right way depends according to the requirement that you are following.
Since, I follow python programming here i will mention how to extract data from python using different ways. The ways i have mentioned are the real time work experience and are truly tested and works perfectly.
There are several libraries available but the best libraries that you can follow for extraction is:
- Tabula
- Camelot
Here, one thing that we need to keep in our mind is that in both libraries several features are available but we'll be using the features that are necessary and can get our work done.
Tabula:
Tabula is a python library which can be used to extract data from normal or readable PDF's. or semi-scanned PDF's and it can extract only tabular data from PDF's.
It has several features like scrapping the data from the all the PDF pages, scrapping using area or co-ordinate technique etc.
To install tabula in your system you can follow this syntax:
pip install tabuala-py
Since tabula uses a java run time environment to run the python script you need to download a java runtime environment from official site or you can follow up this link:
https://docs.oracle.com/goldengate/1212/gg-winux/GDRAD/java.htm#BGBFHBEA
After installing the run-time environment we can run our python code in any IDLE to get the tables from the PDF.
Code to read all pages from a PDF file:
get_tables = tabula.read_pdf(pdf="Path", pages="all, encoding = 'ISO-8859-1')
After getting all PDF you can access the tables from each pages using array indexing. For
example get_tables[0], get_tables[0] etc.
One thing you have to keep in your mind is that what ever the tables you will extract using either Tabula or Camelot the data in the tables will come uncleaned, not suitable data types so you have to use Pandas Library to clean and convert them into suitable format for your requirement.
Code to read data from pages one by one or a specific pages:
get_tables=tabula.read_pdf(padf="Path", pages="1, 2, 3, 4, ..", encoding="ISO-8859-1")
Code to read data from specific part of the pages using co-ordinates:
For this technique we need co-ordinates to extract data from the specific part of the whole pdf using tabula.
For this first we need co-ordinates which we have to generate first, and for this we need to download the tabula application, for this download the tabula application using the link:
https://tabula.technology/
After downloading the tabula in a folder go inside the folder and select the tabula.exe file which will start a local server on your computer and a new window will open in your default browser. There import your PDF file, and click on the extract button, it will navigate you to a new page there using the cursor you can select the area of the PDF where our table is located. Then select preview button and from the above bar change the option from CSV to Script file then click on export button and it will download a .sh file in your computer from, now open the file in an IDLE from there copy the co-ordinates and paste them separately somewhere.
Example:
java -jar tabula-java.jar -a 164.093,47.813,782.978,564.953 -p 4 "$1"
Here the highlighted are is the co-ordinates which we need to get our data. Then we will use this co-ordinates in our python code.
table = tabula.read_pdf(pdf=Path", pages="page number",
lattice=True, area=(158.13,387.45,340.83,756.63), encoding = 'ISO-8859-1')
Here you also have to use pandas and other python libraries like numpy,
datetime for data cleaning purpose and everything.
Comments
Post a Comment