BigSnarf blog

Infosec FTW

Building your first neural network self driving car in Python

 

1. Get RC Car

2. Learn to drive it

3. Take apart car to see controllers and wireless controller

4. Soldering Iron and Multimeter to determine positive and negative and circuits firing

Testing – Link Mac to Arduino to Wireless Controller

5. Need Arduino board and cable

6. Install software and load Arduino program onto board

7. Install pygame and serial

8. python carDriving.py to test soldering and driving by keyboard

Screen Shot 2016-07-31 at 10.28.15 AM

 

Testing – Capturing image data for training dataset

IMG_0319-1024x683

On the first iteration of the physical devices, I mounted the disassembled Logitech C270/Raspberry Pi on the car with a coat hanger that I chopped up and modified to hold the camera. I pointed it down so it could see the hood and some of the “road”. The webcam  captures video frames of the road ahead  at ~24 fps.

I send the captured stream across the wifi network back to my MacBookPro using python server implementation using basic sockets.

On my MacBookPro laptop computer, I run another client python program to connect to Raspberry Pi using basic sockets. I take the stream color stream 320×240 and down sample and grayscale video frames for preprocessing into a numpy matrix.

Wirelessly stream video and capture using opencv2 and slice into jpeg, preprocess and reshape numpy and feed array into with  key press data as label.

Testing – First Build of Car with components

IMG_0496

Testing – Convert 240×240 into greyscale

57600 input neurons

Take 2 : Using PiCamera and stream images to Laptop

Take 2 -Load new Arduino Sketch and change PINS

Take 2 – Stream Data from Pi to Laptop

Train Neural Network with train.pkl

Converted numpy data to pickle and then use it for training python simple 3 layer neural network. 65536 neurons for input layer,  1000 neurons for hidden layer and 4 output neurons.  Forward, None, Left, and Right.

 

Check predictions of Neural Network

 

Test driving car via key press

Test driving car via prediction

 

Test trained Neural Network with live camera data…enjoy!

 

Links

Next Steps

  • Deep Learning
  • Computer Vision
  • Vehicle Dynamics
  • Controllers
  • Localization,
  • Mapping (SLAM)
  • Sensors & Fusion
  • Safety Systems and Ethics

ReportStyleDocumentaton build RC custom

 

 

 

 

 

LIDAR and Deep Learning

LiDAR sensors and software for real-time capture and processing of 3D mapping data and object detection, tracking, and classification. Can be used in self driving cars, security perimeter systems, interior security systems.

http://images.nvidia.com/content/tegra/automotive/images/2016/solutions/pdf/end-to-end-dl-using-px.pdf

http://conference.scipy.org/proceedings/scipy2012/pdfs/iqbal_mohomed.pdf

http://juxi.net/workshop/deep-learning-rss-2016/papers/Nicolai%20-%20Deep%20Learning%20Lidar%20Odometry.pdf

https://github.com/dps/nnrccar

https://gopigo.firebaseapp.com/

http://conference.scipy.org/proceedings/scipy2012/pdfs/iqbal_mohomed.pdf

http://www.danielgm.net/cc/

https://github.com/bigsnarfdude/loam_velodyne

http://www.phoenix-aerial.com/information/lidar-comparison/

http://www.gim-international.com/content/news/9-revolutionary-lidar-survey-projects

http://velodynelidar.com/vlp-16-lite.html

https://www.idaholidar.org/free-lidar-tools/

http://www.technavio.com/blog/top-companies-global-automotive-lidar-sensors-market

https://zhengludwig.wordpress.com/projects/self-driving-rc-car/

Neural Network Driving in GTAV

http://deepdrive.io/

Register

https://github.com/samjabrahams/tensorflow-on-raspberry-pi

Drive a Lamborghini With Your Keyboard

http://www.acmesystems.it/timelaps_video

 

Convolutional Neural Network in one picture

convnet.png

Deep Learning Malware and Network Flows

Using Inception v3 Tensorflow for MNIST

Modern object recognition models have millions of parameters and can take weeks to fully train. Transfer learning is a technique that shortcuts a lot of this work by taking a fully-trained model for a set of categories like ImageNet, and retrains from the existing weights for new classes. In this example we’ll be retraining the final layer from scratch, while leaving all the others untouched. For more information on the approach you can see this paper on Decaf.

Though it’s not as good as a full training run, this is surprisingly effective for many applications, and can be run in as little as 75 minutes on a laptop, without requiring a GPU. The data I used is from Kaggle MNIST dataset.

Let’s reshape the train.csv data from Kaggle with this script to jpegs

Screen Shot 2016-07-19 at 8.24.47 PM

Script to convert train.csv to images in python

 

Let’s move the data to the proper folders

 

These are screenshots of the re-trained Inception v3 model

Screen Shot 2016-07-19 at 1.59.35 PM

Screen Shot 2016-07-19 at 2.02.52 PM

 

Re – Training ModelScreen Shot 2016-07-19 at 8.21.16 PM

Using the re-trained model to do MNIST prediction

Links:

Neural Network from scratch in Python

So you want to teach a computer to recognize handwritten digits? You want to code this out in Python? You understand a little about Machine Learning? You wanna build a neural network?

Let’s try and implement a simple 3-layer neural network (NN) from scratch. I won’t get into the math because I suck at math, let alone trying to teach it.  I can also point to moar math resources if you read up on the details.

I assume you’re familiar with basic Machine Learning concepts like classification and regularization. Oh, and how optimization techniques like gradient descent work.

So, why not teach you Tensorflow or some other deep learning framework? I found that I learn best when I see the code, and learn the basics of the implementation. I find it helps me with intuition in choosing each part of the model. Of course, there are some AutoML solutions that could get me quicker ways to a baseline, but I still wouldn’t know anything. I’m trying to get out of just running the code like a script kiddie.

So let’s get started!

For the past few months (thanks Arvin),  I have learned to appreciate both Classic Machine Learning (prior 2012) and Deep Learning techniques to model Kaggle competition data.

The handwritten digits competition was my first attempt at deep learning. So, I think it’s appropriate that it’s your first example to do deep learning. I remember this important gotcha moment. It was seeing the relationships between the data and pictures. It helped me to imagine the deep learning concepts visually.

What does the data look like?

We’re going to use the classic visual recognition challenge data set, called the MNIST data set. Kaggle competitions are awesome because you can self score your solutions and they provide data in simple clean CSV files.  If successful, we should have a deep learning solution that should be the able to classify 25,000 images with a correct label. Let’s look at the CSV data.

Using a Jupyter notebook, let’s dump the data into a numpy matrix, and reshape it back into a picture. Each digit has been normalized to a 28 by 28 matrix.

 

The goal is to take the training data as an input (handwritten digit), pump it through the deep learning model, and predict if the data is a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.

 

Architecture of a Simple Neural Network

1. Picking the shape of the neural network. I’m gonna choose a simple NN consisting of three layers:

  • First Layer: Input layer (784 neurons)
  • Second Layer: Hidden layer (n = 15 neurons)
  • Third Layer: Output layer

Here’s a look of the 3 layer network proposed above:

Basic Structure of the code

Data structure to hold our data

2.  Picking the right matrix data structure. Nested python lists? CudaMAT? Python Dict? I’m choosing numpy because we’ll heavily use np.dot, np.reshape, np.random, np.zeros, np.argmax, and np.exp functions that I’m not really interested in implementing from scratch.

Simulating perceptrons using an Activation Function

3.  Picking the activation function for our hidden layer. The activation function transforms the inputs of the hidden layer into its outputs. Common choices for activation functions are tanh, the sigmoid function, or ReLUs. We’ll use the sigmoid function.

Python Neural Network Object

Feed Forward Function

a.k.a The Forward Pass

The purpose of the feed forward function is to pass the input into the NN matrix and return the new activations.

Stochastic Gradient Descent function (SGD)

fff

Update Mini Batch Function

Mini-batch gradient descent can work a bit faster than stochastic gradient descent. In Batch gradient descent we will use all m examples in each generation. Whereas in Stochastic gradient descent we will use a single example in each generation. What Mini-batch gradient descent does is somewhere in between. Specifically, with this algorithm we’re going to use b examples in each iteration where b is a parameter called the “mini batch size” so the idea is that this is somewhat in-between Batch gradient descent and Stochastic gradient descent.

Back Prop Function

a.k.a The Backwards Pass

Our goal with back propagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole.  Back prop is a method to stop us from overfitting our model, so the model is more generalized.

Cost Derivative Function

So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).

 Putting it all together – Network.py

Links

Flask Digits Classifier

https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/index.html#0

Audit Security in AWS

When Should You Perform a Security Audit?

You should audit your security configuration in the following situations:

  • On a periodic basis. You should perform the steps described in this document at regular intervals as a best practice for security.
  • If there are changes in your organization, such as people leaving.
  • If you have stopped using one or more individual AWS services. This is important for removing permissions that users in your account no longer need.
  • If you’ve added or removed software in your accounts, such as applications on Amazon EC2 instances, AWS OpsWorks stacks, AWS CloudFormation templates, etc.
  • If you ever suspect that an unauthorized person might have accessed your account.

General Guidelines for Auditing

As you review your account’s security configuration, follow these guidelines:

  • Be thorough. Look at all aspects of your security configuration, including those you might not use regularly.
  • Don’t assume. If you are unfamiliar with some aspect of your security configuration (for example, the reasoning behind a particular policy or the existence of a role), investigate the business need until you are satisfied.
  • Keep things simple. To make auditing (and management) easier, use IAM groups, consistent naming schemes, and straightforward policies.

Review Your AWS Account Credentials

Take these steps when you audit your AWS account credentials:

  1. If you’re not using the root access keys for your account, remove them. We strongly recommend that you do not use root access keys for everyday work with AWS, and that instead you create IAM users.
  2. If you do need to keep the access keys for your account, rotate them regularly.

Review Your IAM Users

Take these steps when you audit your existing IAM users:

  1. Delete users that are not active.
  2. Remove users from groups that they don’t need to be a part of.
  3. Review the policies attached to the groups the user is in. See Tips for Reviewing IAM Policies.
  4. Delete security credentials that the user doesn’t need or that might have been exposed. For example, an IAM user that is used for an application does not need a password (which is necessary only to sign in to AWS websites). Similarly, if a user does not use access keys, there’s no reason for the user to have one. For more information, see Managing Passwords for IAM Users and Managing Access Keys for IAM Users in the IAM User Guide guide.

    You can generate and download a credential report that lists all IAM users in your account and the status of their various credentials, including passwords, access keys, and MFA devices. For passwords and access keys, the credential report shows how recently the password or access key has been used. Credentials that have not been used recently might be good candidates for removal. For more information, see Getting Credential Reports for your AWS Account in the IAM User Guide guide.

  5. Rotate (change) user security credentials periodically, or immediately if you ever share them with an unauthorized person. For more information, see Managing Passwords for IAM Users and Managing Access Keys for IAM Users in the IAM User Guide guide.

Review Your IAM Groups

Take these steps when you audit your IAM groups:

  1. Delete unused groups.
  2. Review users in each group and remove users who don’t belong. See Review Your IAM Users earlier.
  3. Review the policies attached to the group. See Tips for Reviewing IAM Policies.

Review Your IAM Roles

Take these steps when you audit your IAM roles:

  1. Delete roles that are not in use.
  2. Review the role’s trust policy. Make sure that you know who the principal is and that you understand why that account or user needs to be able to assume the role.
  3. Review the access policy for the role to be sure that it grants suitable permissions to whoever assumes the role—see Tips for Reviewing IAM Policies.

Review Your IAM Providers for SAML and OpenID Connect (OIDC)

If you have created an IAM entity for establishing trust with a SAML or OIDC identity provider, take these steps:

  1. Delete unused providers.
  2. Download and review the AWS metadata documents for each SAML provider and make sure the documents reflect your current business needs. Alternatively, get the latest metadata documents from the SAML IdPs that you want to establish trust with and update the provider in IAM.

Review Your Mobile Apps

If you have created a mobile app that makes requests to AWS, take these steps:

  1. Make sure that the mobile app does not contain embedded access keys, even if they are in encrypted storage.
  2. Get temporary credentials for the app by using APIs that are designed for that purpose. We recommend that you use Amazon Cognito to manage user identity in your app. This service lets you authenticate users using Login with Amazon, Facebook, Google, or any OpenID Connect (OIDC)–compatible identity provider. You can then use the Amazon Cognito credentials provider to manage credentials that your app uses to make requests to AWS.

    If your mobile app doesn’t support authentication using Login with Amazon, Facebook, Google, or any other OIDC-compatible identity provider, you can create a proxy server that can dispense temporary credentials to your app.

Review Your Amazon EC2 Security Configuration

Take the following steps for each AWS region:

  1. Delete Amazon EC2 key pairs that are unused or that might be known to people outside your organization.
  2. Review your Amazon EC2 security groups:
    • Remove security groups that no longer meet your needs.
    • Remove rules from security groups that no longer meet your needs. Make sure you know why the ports, protocols, and IP address ranges they permit have been allowed.
  3. Terminate instances that aren’t serving a business need or that might have been started by someone outside your organization for unapproved purposes. Remember that if an instance is started with a role, applications that run on that instance can access AWS resources using the permissions that are granted by that role.
  4. Cancel spot instance requests that aren’t serving a business need or that might have been made by someone outside your organization.
  5. Review your Auto Scaling groups and configurations. Shut down any that no longer meet your needs or that might have been configured by someone outside your organization.

Review AWS Policies in Other Services

Review the permissions for services that use resource-based policies or that support other security mechanisms. In each case, make sure that only users and roles with a current business need have access to the service’s resources, and that the permissions granted on the resources are the fewest necessary to meet your business needs.

Monitor Activity in Your AWS Account

Follow these guidelines for monitoring AWS activity:

  • Turn on AWS CloudTrail in each account and use it in each supported region.
  • Periodically examine CloudTrail log files. (CloudTrail has a number of partners who provide tools for reading and analyzing log files.)
  • Enable Amazon S3 bucket logging to monitor requests made to each bucket.
  • If you believe there has been unauthorized use of your account, pay particular attention to temporary credentials that have been issued. If temporary credentials have been issued that you don’t recognize, disabletheir permissions.
  • Enable billing alerts in each account and set a cost threshold that lets you know if your charges exceed your normal usage.

Tips for Reviewing IAM Policies

Policies are powerful and subtle, so it’s important to study and understand the permissions that are granted by each policy. Use the following guidelines when reviewing policies:

  • As a best practice, attach policies to groups instead of to individual users. If an individual user has a policy, make sure you understand why that user needs the policy.
  • Make sure that IAM users, groups, and roles have only the permissions that they need.
  • Use the IAM Policy Simulator to test policies that are attached to users or groups.
  • Remember that a user’s permissions are the result of all applicable policies—user policies, group policies, and resource-based policies (on Amazon S3 buckets, Amazon SQS queues, Amazon SNS topics, and AWS KMS keys). It’s important to examine all the policies that apply to a user and to understand the complete set of permissions granted to an individual user.
  • Be aware that allowing a user to create an IAM user, group, role, or policy and attach a policy to the principal entity is effectively granting that user all permissions to all resources in your account. That is, users who are allowed to create policies and attach them to a user, group, or role can grant themselves any permissions. In general, do not grant IAM permissions to users or roles whom you do not trust with full access to the resources in your account. The following list contains IAM permissions that you should review closely:
    • iam:PutGroupPolicy
    • iam:PutRolePolicy
    • iam:PutUserPolicy
    • iam:CreatePolicy
    • iam:CreatePolicyVersion
    • iam:AttachGroupPolicy
    • iam:AttachRolePolicy
    • iam:AttachUserPolicy
  • Make sure policies don’t grant permissions for services that you don’t use. For example, if you use AWS managed policies, make sure the AWS managed policies that are in use in your account are for services that you actually use. To find out which AWS managed policies are in use in your account, use the IAMGetAccountAuthorizationDetails API (AWS CLI command: aws iam get-account-authorization-details).
  • If the policy grants a user permission to launch an Amazon EC2 instance, it might also allow the iam:PassRoleaction, but if so it should explicitly list the roles that the user is allowed to pass to the Amazon EC2 instance.
  • Closely examine any values for the Action or Resource element that include *. It’s a best practice to grantAllow access to only the individual actions and resources that users need. However, the following are reasons that it might be suitable to use * in a policy:
    • The policy is designed to grant administrative-level privileges.
    • The wildcard character is used for a set of similar actions (for example, Describe*) as a convenience, and you are comfortable with the complete list of actions that are referenced in this way.
    • The wildcard character is used to indicate a class of resources or a resource path (e.g.,arn:aws:iam::account-id:users/division_abc/*), and you are comfortable granting access to all of the resources in that class or path.
    • A service action does not support resource-level permissions, and the only choice for a resource is *.
  • Examine policy names to make sure they reflect the policy’s function. For example, although a policy might have a name that includes “read only,” the policy might actually grant write or change permissions.

The rush to build KNOWLEDGE DISCOVERY ENGINES

People who build AI are focused on three areas: machine intelligence, natural language processing, and machine perception. That is, building systems that can think, listen, and see.

 

Screen Shot 2016-07-09 at 11.56.44 AM

Data Mining Process

Screenshot-2016-04-20-11.58.54

STAGE ONE – DETERMINE BUSINESS OBJECTIVES

The first stage of the CRISP-DM process is to understand what you want to accomplish from a business perspective. Your organization may have competing objectives and constraints that must be properly balanced. The goal of this stage of the process is to uncover important factors that could influence the outcome of the project. Neglecting this step can mean that a great deal of effort is put into producing the right answers to the wrong questions.

What are the desired outputs of the project?

  1. Set objectives – This means describing your primary objective from a business perspective. There may also be other related questions that you would like to address. For example, your primary goal might be to keep current customers by predicting when they are prone to move to a competitor. Related business questions might be “Does the channel used affect whether customers stay or go?” or “Will lower ATM fees significantly reduce the number of high-value customers who leave?”
  2. Produce project plan – Here you’ll describe the plan for achieving the data mining and business goals. The plan should specify the steps to be performed during the rest of the project, including the initial selection of tools and techniques.
  3. Business success criteria – Here you’ll lay out the criteria that you’ll use to determine whether the project has been successful from the business point of view. These should ideally be specific and measurable, for example reduction of customer churn to a certain level, however sometimes it might be necessary to have more subjective criteria such as “give useful insights into the relationships.” If this is the case then it needs to be clear who it is that makes the subjective judgment.

Assess the current situation

This involves more detailed fact-finding about all of the resources, constraints, assumptions and other factors that you’ll need to consider when determining your data analysis goal and project plan.

  1. Inventory of resources – List the resources available to the project including:
    • Personnel (business experts, data experts, technical support, data mining experts)
    • Data (fixed extracts, access to live, warehoused, or operational data)
    • Computing resources (hardware platforms)
    • Software (data mining tools, other relevant software)
  2. Requirements, assumptions and constraints – List all requirements of the project including the schedule of completion, the required comprehensibility and quality of results, and any data security concerns as well as any legal issues. Make sure that you are allowed to use the data. List the assumptions made by the project. These may be assumptions about the data that can be verified during data mining, but may also include non-verifiable assumptions about the business related to the project. It is particularly important to list the latter if they will affect the validity of the results. List the constraints on the project. These may be constraints on the availability of resources, but may also include technological constraints such as the size of data set that it is practical to use for modelling.
  3. Risks and contingencies – List the risks or events that might delay the project or cause it to fail. List the corresponding contingency plans – what action will you take if these risks or events take place?
  4. Terminology – Compile a glossary of terminology relevant to the project. This will generally have two components:
    • A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful “knowledge elicitation” and education exercise.
    • A glossary of data mining terminology, illustrated with examples relevant to the business problem in question.
  5. Costs and benefits – Construct a cost-benefit analysis for the project which compares the costs of the project with the potential benefits to the business if it is successful. This comparison should be as specific as possible. For example, you should use financial measures in a commercial situation.

Determine data mining goals

A business goal states objectives in business terminology. A data mining goal states project objectives in technical terms. For example, the business goal might be “Increase catalogue sales to existing customers.” A data mining goal might be “Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city, etc.), and the price of the item.”

  1. Business success criteria  – describe the intended outputs of the project that enable the achievement of the business objectives.
  2. Data mining success criteria – define the criteria for a successful outcome to the project in technical terms—for example, a certain level of predictive accuracy or a propensity-to-purchase profile with a given degree of “lift.” As with business success criteria, it may be necessary to describe these in subjective terms, in which case the person or persons making the subjective judgment should be identified.

Produce project plan

Describe the intended plan for achieving the data mining goals and thereby achieving the business goals. Your plan should specify the steps to be performed during the rest of the project, including the initial selection of tools and techniques.

  1. Project plan – List the stages to be executed in the project, together with their duration, resources required, inputs, outputs, and dependencies. Where possible, try and make explicit the large-scale iterations in the data mining process, for example, repetitions of the modelling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document. At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan.
  2. Initial assessment of tools and techniques – At the end of the first phase you should undertake an initial assessment of tools and techniques. Here, for example, you select a data mining tool that supports various methods for different stages of the process. It is important to assess tools and techniques early in the process since the selection of tools and techniques may influence the entire project.

STAGE TWO – DATA UNDERSTANDING

The second stage of the CRISP-DM process requires you to acquire the data listed in the project resources. This initial collection includes data loading, if this is necessary for data understanding. For example, if you use a specific tool for data understanding, it makes perfect sense to load your data into this tool. If you acquire multiple data sources then you need to consider how and when you’re going to integrate these.

  • Initial data collection report – List the data sources acquired together with their locations, the methods used to acquire them and any problems encountered. Record problems you encountered and any resolutions achieved. This will help both with future replication of this project and with the execution of similar future projects.

Describe data

Examine the “gross” or “surface” properties of the acquired data and report on the results.

  • Data description report – Describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Evaluate whether the data acquired satisfies your requirements.

Explore data

During this stage you’ll address data mining questions using querying, data visualization and reporting techniques. These may include:

  • Distribution of key attributes (for example, the target attribute of a prediction task)
  • Relationships between pairs or small numbers of attributes
  • Results of simple aggregations
  • Properties of significant sub-populations
  • Simple statistical analyses

These analyses may directly address your data mining goals. They may also contribute to or refine the data description and quality reports, and feed into the transformation and other data preparation steps needed for further analysis.

  • Data exploration report – Describe results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

Verify data quality

Examine the quality of the data, addressing questions such as:

  • Is the data complete (does it cover all the cases required)?
  • Is it correct, or does it contain errors and, if there are errors, how common are they?
  • Are there missing values in the data? If so, how are they represented, where do they occur, and how common are they?

Data quality report

List the results of the data quality verification. If quality problems exist, suggest possible solutions. Solutions to data quality problems generally depend heavily on both data and business knowledge.

STAGE THREE – DATA PREPARATION

Select your dataThis is the stage of the project where you decide on the data that you’re going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your data mining goals, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

  • Rationale for inclusion/exclusion – List the data to be included/excluded and the reasons for these decisions.

Clean your data

This task involves raise the data quality to the level required by the analysis techniques that you’ve selected. This may involve selecting clean subsets of the data, the insertion of suitable defaults, or more ambitious techniques such as the estimation of missing data by modelling.

  • Data cleaning report – Describe what decisions and actions you took to address data quality problems. Consider any transformations of the data made for cleaning purposes and their possible impact on the analysis results.

Construct required data

This task includes constructive data preparation operations such as the production of derived attributes or entire new records, or transformed values for existing attributes.

  • Derived attributes – These are new attributes that are constructed from one or more existing attributes in the same record, for example you might use the variables of length and width to calculate a new variable of area.
  • Generated records – Here you describe the creation of any completely new records. For example you might need to create records for customers who made no purchase during the past year. There was no reason to have such records in the raw data, but for modelling purposes it might make sense to explicitly represent the fact that particular customers made zero purchases.

Integrate data

These are methods whereby information is combined from multiple databases, tables or records to create new records or values.

  • Merged data – Merging tables refers to joining together two or more tables that have different information about the same objects. For example a retail chain might have one table with information about each store’s general characteristics (e.g., floor space, type of mall), another table with summarized sales data (e.g., profit, percent change in sales from previous year), and another with information about the demographics of the surrounding area. Each of these tables contains one record for each store. These tables can be merged together into a new table with one record for each store, combining fields from the source tables.
  • Aggregations – Aggregations refers to operations in which new values are computed by summarising information from multiple records and/or tables. For example, converting a table of customer purchases where there is one record for each purchase into a new table where there is one record for each customer, with fields such as number of purchases, average purchase amount, percent of orders charged to credit card, percent of items under promotion etc.

STAGE FOUR – MODELLING

Select modeling technique

As the first step in modelling, you’ll select the actual modelling technique that you’ll be using. Although you may have already selected a tool during the business understanding phase, at this stage you’ll be selecting the specific modelling technique e.g. decision-tree building with C5.0, or neural network generation with back propagation. If multiple techniques are applied, perform this task separately for each technique.

  • Modelling technique – Document the actual modelling technique that is to be used.
  • Modelling assumptions – Many modelling techniques make specific assumptions about the data, for example that all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any assumptions made.

Generate test design

Before you actually build a model you need to generate a procedure or mechanism to test the model’s quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, you typically separate the dataset into train and test sets, build the model on the train set, and estimate its quality on the separate test set.

  • Test design – Describe the intended plan for training, testing, and evaluating the models. A primary component of the plan is determining how to divide the available dataset into training, test and validation datasets.

Build model

Run the modelling tool on the prepared dataset to create one or more models.

  • Parameter settings – With any modelling tool there are often a large number of parameters that can be adjusted. List the parameters and their chosen values, along with the rationale for the choice of parameter settings.
  • Models – These are the actual models produced by the modelling tool, not a report on the models.
  • Model descriptions – Describe the resulting models, report on the interpretation of the models and document any difficulties encountered with their meanings.

Assess model

Interpret the models according to your domain knowledge, your data mining success criteria and your desired test design. Judge the success of the application of modelling and discovery techniques technically, then contact business analysts and domain experts later in order to discuss the data mining results in the business context. This task only considers models, whereas the evaluation phase also takes into account all other results that were produced in the course of the project.

At this stage you should rank the models and assess them according to the evaluation criteria. You should take the business objectives and business success criteria into account as far as you can here. In most data mining projects a single technique is applied more than once and data mining results are generated with several different techniques.

  • Model assessment – Summarise the results of this task, list the qualities of your generated models (e.g.in terms of accuracy) and rank their quality in relation to each other.
  • Revised parameter settings – According to the model assessment, revise parameter settings and tune them for the next modelling run. Iterate model building and assessment until you strongly believe that you have found the best model(s). Document all such revisions and assessments.

STAGE FIVE – EVALUATION

 

Evaluate your resultsPrevious evaluation steps dealt with factors such as the accuracy and generality of the model. During this step you’ll assesses the degree to which the model meets your business objectives and seek to determine if there is some business reason why this model is deficient. Another option is to test the model(s) on test applications in the real application, if time and budget constraints permit. The evaluation phase also involves assessing any other data mining results you’ve generated. Data mining results involve models that are necessarily related to the original business objectives and all other findings that are not necessarily related to the original business objectives, but might also unveil additional challenges, information, or hints for future directions.

  • Assessment of data mining results – Summarise assessment results in terms of business success criteria, including a final statement regarding whether the project already meets the initial business objectives.
  • Approved models – After assessing models with respect to business success criteria, the generated models that meet the selected criteria become the approved models.

Review process

At this point, the resulting models appear to be satisfactory and to satisfy business needs. It is now appropriate for you to do a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has somehow been overlooked. This review also covers quality assurance issues—for example: did we correctly build the model? Did we use only the attributes that we are allowed to use and that are available for future analyses?

  • Review of process – Summarise the process review and highlight activities that have been missed and those that should be repeated.

Determine next steps

Depending on the results of the assessment and the process review, you now decide how to proceed.Do you finish this project and move on to deployment, initiate further iterations, or set up new data mining projects? You should also take stock of your remaining resources and budget as this may influence your decisions.

  • List of possible actions – List the potential further actions, along with the reasons for and against each option.
  • Decision – Describe the decision as to how to proceed, along with the rationale.

STAGE SIX – DEPLOYMENT

Plan deployment

In the deployment stage you’ll take your evaluation results and determine a strategy for their deployment. If a general procedure has been identified to create the relevant model(s), this procedure is documented here for later deployment. It makes sense to consider the ways and means of deployment during the business understanding phase as well, because deployment is absolutely crucial to the success of the project. This is where predictive analytics really helps to improve the operational side of your business.

  • Deployment plan – Summarise your deployment strategy including the necessary steps and how to perform them.

Plan monitoring and maintenance

Monitoring and maintenance are important issues if the data mining result becomes part of the day-to-day business and its environment. The careful preparation of a maintenance strategy helps to avoid unnecessarily long periods of incorrect usage of data mining results. In order to monitor the deployment of the data mining result(s), the project needs a detailed monitoring process plan. This plan takes into account the specific type of deployment.

  • Monitoring and maintenance plan – Summarise the monitoring and maintenance strategy, including the necessary steps and how to perform them.

Produce final report

At the end of the project you will write up a final report. Depending on the deployment plan, this report may be only a summary of the project and its experiences (if they have not already been documented as an ongoing activity) or it may be a final and comprehensive presentation of the data mining result(s).

  • Final report – This is the final written report of the data mining engagement. It includes all of the previous deliverables, summarising and organising the results.
  • Final presentation – There will also often be a meeting at the conclusion of the project at which the results are presented to the customer.

Review project

Assess what went right and what went wrong, what was done well and what needs to be improved.

  • Experience documentation – Summarize important experience gained during the project. For example, any pitfalls you encountered, misleading approaches, or hints for selecting the best suited data mining techniques in similar situations could be part of this documentation. In ideal projects, experience documentation also covers any reports that have been written by individual project members during previous phases of the project.

https://www.the-modeling-agency.com/crisp-dm.pdf

Hyper Parameter Optimization and AutoML

Follow

Get every new post delivered to your Inbox.

Join 55 other followers