M05 - Recommenders

Just to remind what we understand as a recommenders:

Recommenders - Recommendation models, are a commonly used type of machine learning solution that matches users to items. While you can use regression, classification, and clustering models to build recommenders, a more common approach is to use a filter-based recommender that uses matrix factorization. This is a technique in which known ratings given by users to items are used to determine likely ratings that are not present in the matrix.

1. Algorithms for recommendation

  • Filter based recommender

  • Classifiers

  • Clustering

  • Forecasting

2. Recommender process

  • Define recommendation objective

  • Determine which method to use

Maybe we will need to use matrix factorization, item recommendation or find related to each other users, items or maybe predict rating.

  • Determine type of recommendation

  • Create, evaluate and improve model

3. Metrics for recommendations

  • Manhattan or Euclidian distances
  • Normalized Discounted Cumulative Gain (NDCG) for item recommendation, related items, and related users.
  • Mean Absolute Error (MAE) and Root Mean Squared Error (RSME) for rating prediction

Azure ML modules:

Basic - OSINT - M01 - Gathering data from Twitter

This post is a part of the OSINT series.

1. Why would you want to use Twitter in OSINT?

These days Twitter is the most commonly used micro-blogging platform in the world. Below I would like to show you some of the most important Twitter numbers:

  • Active users: 302 million

  • Tweets send per day: 500 million

  • Minutes per month on Twitter: 170

  • Accounts outside the U.S: 77%

  • Photos in tweets: 55%

  • Links in tweets: 31%

Worldwide distribution of users:

  • Asia Pacific - 33%

  • North America - 24%

  • Western Europe - 17%

  • Latin America - 12%

  • Central & Eastern Europe - 7%

  • Middle East & Africa - 7%

Twitter also provides a Twitter API which can be very usefull for data mining and gathering interesting information about Twitter users. In this article we are going to use Tweepy library which will help us to use the API with Python language.

2. How to use Twitter in OSINT?

2.1 Get user tweets and save them.

How to run this code:

python all_tweets_from_user.py -u tomkowalczyk

Parameters:

-u - Twitter user name

Output:

tomkowalczyk_tweets.csv

File Header:

"id","created_at","text"

Comment:

When having these kind of data about user we then will be able to find out more information using other technics, For example:

  • what is interesting for this user?
  • about what he tweets?

2.2 Discover friends for Twitter user.

How to run this code:

python user_friends.py -u tomkowalczyk

Parameters:

-u - Twitter user name

Output:

friends_of_tomkowalczyk.csv

File Header:

"friend_id","name"

Comment:

When having these kind of data about user we then will be able to find out more information about him, For example:

  • who are users friends?
  • what kind of people they are?
  • do they might have any others mutual connections?

2.3 Discover followers for Twitter user.

How to run this code:

python user_followers.py -u tomkowalczyk

Parameters:

-u - Twitter user name

Output:

followers_of_tomkowalczyk.csv

File Header:

"follower_id","name"

Comment:

When having these kind of data about user we then will be able to find out more information about him, For example:

  • who are users whom user is following?
  • what kind of people they are?
  • do they might have any others mutual connections?

2.4 Determine about what people are tweeting at the current moment.

How to run this code:

python location_trends.py -w 523920

Parameters:

-w - WOEID code of location

Output:

Output printed in console

Comment:

When having these kind of data about user we then will be able to find out more information about specific region, For example:

  • what is happening there?
  • what interests people out there?

3. Next steps

The aim of this article was to show and present the possibilities of using Twitter as a source of data, I described why Twitter can be so important in OSINT and how to gathenring data from it.

Please remember that this is just an introduction and covers only data gathering, you should realize that this is just a first step into OSINT world, in next we should know how to analyze collected data and how to get information from such data. These steps I would like to describe in Advanced OSINT series, so please leave the comment below if you are interested in such series of articles.

4. Additional information

The source code attached to this article is available on GitHub.

To start working with Twitter API you will need to register as a developer to gain developer keys, you can of course find on Google how to do this or just simple read the instructions from Twitter Developer Site and go to the Twitter Apps.

For the fourth sample code you will also need a WOEID of location you would like to follow, you can get it from here.

M04 - Regression, Classification, and Unsupervised Learning

1. Regression and Classification process modeling

Just to remind what we understand as a regression and classification:

Both are Supervised Learning models.

Regression - Regression models predict a numeric value for a label based on a function that applies coefficients to a set of known feature values. Regression is a form of supervised learning, so the function and coefficients are determined by training a regression algorithm with a training dataset, and evaluating it against a testing data set, in which the label values are known.

Classification - Classification models predict a categorical label value based on a function that applies coefficients to a set of feature values. The simplest classification models predict True or False (1 or 0), but you can also create multi-class classification models that are used to classify entities into a set of defined classes.

How the process looks like:

1.1 We have to understand data relationships, we have to know what kind of data it is and from where the data come from

1.2 In the second step we have to select only those features which are important for our study

1.3 After this we should select metric which suits best our needs

1.4 Modeling our experiment:

  • Create model

  • Evaluate model

  • Improve model

  • Cross Validate model

Modeling usually needs to take below actions to improve its results:

  • understanding residuals

  • filter, transform the data

  • feature engineering

  • better feature selection

  • use different type of model

  • choice of model parameters

Cross Validation steps:

  • divide data into approximately-equallt sized 10 parts

  • train the algorithm on 9 parts, compute the evaluation measure on the last part

  • repeat this 10 times, using each part in a turn as test part

  • report the mean and standard deviation of the evaluation measure over 10 parts

Azure ML modules:

2. Unsupervised Learning models

Just to remind what we understand as a unsupervised Learning models:

Unsupervised Learning models - Unsupervised learning models are based on a function that categorizes entities by applying coefficients to numeric feature values. The main difference between supervised learning (like regression and classification) and unsupervised learning (like clustering) is that in unsupervised learning, there are no known label values with which to train the model. The model simply groups entities together into a specified number of clusters based on similarities, which are usually determined by calculating the mathematical distance between the entities.

Before we start the process it will be better to know:

2.1 What is a business problem

2.2 There is no ground truth because there are no labels

2.3 Evaluation is a huge challenge (mainly visualizations helps here)

2.4 What does the structure of the data tell us

2.5 Do different models yield different results

2.6 How many clusters we are expecting, how many of those will be usefull

Most popular algorithms:

  • K-means
  • Hierarchical Agglomerative clustering

How to evaluate:

  • Are cluster well separated?

Maybe they are lying on each other are they are just a random big ellipses.

  • Does the structure tell us anything?

When we are looking at plots or projections does it tell us anythin.

Azure ML modules: