M02 - Working with Data

Data Science is all about data, so now we are going to focus on data aspects.

Azure ML Data I/O options

Azure ML Data source modules

Allow us to interact with external applications for example, we can take data from and publish results to Excel or Microsoft Power BI, or really any other application out there that understands web services protocols.

Reader and Writer module, that allow us to move large amounts of data between storage layers of the Cortana suite and Azure ML.

Reading image data into Azure ML.

Data Flow

Experiment - data flow in Data Science process in Azure ML.

Batch processing - we bring in a big chunk of data, we munge it for quite some time and we spit out the results and that like the time can be from minutes to literally perhaps weeks, if it’s a very large dataset that we are trying to grind our way through.

training models and processing large chunks

Real-Time processing - we are trying to be interactive or semi-interactive with our users. So, maybe through web services there is a business analyst who comes up with a new set of input data and they want to see a result like in a spreadsheet or in a BI tool.

web services, HTTPS connection

Terms to know: joins

R or / and Python

R data frames, pandas data frames in Python and Azure ML's tables are rectangular tables, so that means that each column has to have a unique type. Each column can have a different type, but within that column, you have to have the same type.

Data Sampling and Quantization

Azure ML Data Types:

Terms to know: quantization


Quantizing - converting continuous variables to categorical (small, medium, large; hot, cold)

Azure ML modules:

Data Cleaning and Transformation

Clean, filter and tranform these are most important aspects of any Data Science project and includes:

  1. Identifying and handling missing or duplicate values.

Treating missing values:

Tools for missing values:

Tools for repeated values:

  1. Identifying and handling outliers and errors.

We have many possible sources of errors:

Tools for outliers:

Sometimes it is good to use visualization which helps validate outliers.

  1. Scaling numeric values to make them easier to compare.

Reasons for scaling:

Tools for scaling:

scaling, normalization, standarization these all means the same.

Posted with : Machine Learning

If you liked this post, you can share it with your followers or follow me on Twitter!