Basic - OSINT - M05 - Gathering data by Web Scraping

This post is a part of the OSINT series.

1. Why use Web Scraping as an OSINT technique.

In the previous articles of this OSINT series we have covered popular social media services. We should now know that we can gather a lot of interesting informations from these services and use them in our research.

Today we are going to find out what about other websites / services which are not providing an API to their resources or API which they are providing are not enough for our needs.

2. Getting data from services without API using import.io.

Import.io is a great service for fast website scraping, it provides usefull desktop app which not requires coding skills.

Using this service is also very easy, just go to the https://magic.import.io/ and in the search bar paste or type webpage address which you are going to get the data from. Import.io also provide a possibility to pagination, so even when the webpage which you would like to scrape / crawl is divided into pages, import.io will cope with it.

The result of working with import.io is a *.csv file which every researcher is familiar with, after download this file we are able to working with data in most convenient way.

I encourage you all to visit website of this project and give it a try, I believe you will find it usefull shortly.

3. Turning data from pages into API using KimonoLabs.

KimonoLabs.com in some way is similar to the service mentioned above it also allows to use data from websites in the API way, so in simplest words we can connect to these data (from specific website) from other languages and devices. What is also very interesting this API and it content could be updated automatically in predefined time intervals so data gathered this way will be always up to date.

I encourage you all to visit website of this project and give it a try, I believe you will find it usefull shortly.

4. Simple web scraper using Python.

4.1 What and how we are going to scrape?

In this section I will show you how to write the simplest Web Scraper in Python. We are going to scrape http://www.ceneo.pl/ webpage which is one of the most popular price comparison service in Poland. This site has no API so when we want to get some data from it, we have to use web scraping.

Before do a web scraping we have to find out the structure of the webpage. To do this we can simple use Developer Tools from Chrome browser and inspect the webpage in Elements tab. Of course to do this we have to be familiar with HTML and know, what we are going to find. In this section I will not follow this step by step because there is a lot of resources about it over the Internet, but I am pretty sure that the Python source code is clear enough to explain how to scrape a webpage and in the provided source code there are also a lot of comments.

4.2 Source code.

How to run this code:

python pl_ceneo_scrap.py -p PRODUCT+NAME

Parameters:

-p - product name

Output:

PRODUCT+NAME_pl_ceneo_data.csv

File Header:

"item_name","item_price","item_category","req_ip","req_country","req_time","req_price_comp"

5. Find out fanpage likes.

For some reasons there is sometimes very important to know who like the specific fanpage, we can use it for example to propose to those people some similar one or find out who these people are.

Unfortunately there is no API endpoint for this operation in Facebook API and the most common way to do this is a little tricky.

1. Firstly we have to find out fanpage ID.

The fastest way to do this is just go to the findmyfbid.com, put there the fanpage address and click Find numeric ID.

2. Secondly prepare correct web address.

We have to copy the ID from the above result page, and prepare the address like this:

https://www.facebook.com/search/ID/likers

where ID is the fanpage numeric ID (for example: 1029906563686976).

Then we can copy this address, paste it in the web browser and we should be able to see all fans of the fanpage. Of course now we can use Web Scraping and try to scrape this data into convenient *.csv file for example for further research.

Conclusion

Web Scraping could be very powerfull method in OSINT and every researcher should remember about this technique.

Please remember that this is just an introduction and covers only data gathering, you should realize that this is just a first step into OSINT world, in next we should know how to analyze collected data and how to get information from such data. These steps I would like to describe in Advanced OSINT series, so please leave the comment below if you are interested in such series of articles.

Posted with : OSINT

If you liked this post, you can share it with your followers or follow me on Twitter!