DS Practical: Practical-4

Visual programming with Orange tool.

what is visual programming?

In computing, a visual programming language is any programming language that lets users create programs by manipulating program elements graphically rather than by specifying them textually.

Data Sampler

Inputs

Data: input dataset

Outputs

Data Sample: sampled data instances
Remaining Data: out-of-sample data

Many data sampling methods are implemented by the Data Sampler widget. It outputs a sampled and a complementary dataset (with instances from the input set that are not included in the sampled dataset). The output is processed after the input dataset is provided and Sample Data is pressed.

A fixed proportion of data returns a chosen percentage of the entire data (e.g. 70% of all the data)

A fixed sample size returns a selected number of data instances with a chance to set Sample with replacement, which always samples from the entire dataset.

Cross-Validation partitions data instances into the specified number of complementary subsets. Following a typical validation schema, all subsets except the one selected by the user are output as Data Sample, and the selected subset goes to Remaining Data. (Note: In older versions, the outputs were swapped. If the widget is loaded from an older workflow, it switches to compatibility mode.)

Bootstrap infers the sample from the population statistic.

Fixed Sample Size

First, let’s see how the Data Sampler works. We will use the zoo data from the File widget. We see there are 303 instances in the data. We sampled the data with the Data Sampler widget and we chose to go with a fixed sample size of 5 instances for simplicity. We can observe the sampled data in the Data Table widget (Data Table (in-sample)). The second Data Table (Data Table (out-of-sample)) shows the remaining 298 instances that weren’t in the sample. To output the out-of-sample data, double-click the connection between the widgets.

A fixed proportion of data:

Now, we will use the Data Sampler to split the Iris data into training and testing part. We are using the iris data, which we loaded with the File widget. In Data Sampler, we split the data with Fixed proportion of data, keeping 70% of data instances in the sample. Then we connected two outputs to the Test & Score widget, Data Sample –> Data and Remaining Data –> Test Data. Finally, we added Logistic Regression as the learner. This runs logistic regression on the Data input and evaluates the results on the Test Data.

Cross-Validation:

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

DS Practical

Ds Practical

Practical-4

No comments:

Post a Comment