Wednesday, September 25, 2019

Creating a training set table for machine learning in Oracle Database

When building a machine learning model, you will require a learning / training set of data. To enable you to quickly create a set of training data you can make use of the SQL SAMPLE clause in a select statement. Using the SAMPLE clause you instruct the database to select from a random sample of data from the table, rather than from the entire table. This provides a very simple way of getting the random collection of records you require for training your model.

Situation
You do have a large (or small) table of data in your database, in our case we use an Oracle Autonomous Data Warehouse and intend to use a part of this as training data while you want to use the remaining part for testing your model.


Assume we have a table named louwersj.loans which we want to use for both our training data as well as our test data. A simple way of splitting it in a 70/30 fashion is to use the below commands:

Step 1:
Check the total number of records in the table:

SELECT
    COUNT(1)
FROM
    louwersj.loans;

In our case this will give us the result of 614 as we have 614 records in our dataset

Step 2:
Take a 70% of the total and use this to create a table, this is where we will use the sample clause in the SQL statement to ensure we get a random 70% of the records. By issuing the below command the table loans_traindata will be exactly the same as the original table loans with only a random subset of the original table.

CREATE TABLE louwersj.loans_traindata
    AS
        SELECT
            *
        FROM
            louwersj.loans SAMPLE ( 70 ) SEED ( 1 )


To validate if this gives us what we wanted we can do another count to see if we indeed get a training set which contains 70% of the original table, the below command will return 455.

SELECT
    COUNT(1)
FROM
    louwersj.loans_traindata

Step 3:
Next to the train data we need to have some test data to validate the working of our machine learning model after we have trained it. For this we can use the remaining 30% of the data from the original table. With the following command we create a new table which will contain exactly that:

CREATE TABLE louwersj.loans_testdata
    AS
        SELECT
            *
        FROM
            louwersj.loans
        MINUS
        SELECT
            *
        FROM
            louwersj.loans_traindata

Conclusion
Using the SAMPLE clause as part of a CREATE TABLE AS statement in the Oracle Database helps you to speed up creating a good training set and test for your machine learning model. No needing to extract data from the database and re-insert the data, you can do all within the database without any actual moving of the data. 

No comments: