3 Replies Latest reply on Feb 14, 2018 2:27 PM by asoudek

    Train Data & Test Data in Data science

    emmablisa

      I am relatively new Data science in python and was exploring some competition on data science, i am getting confused with "Training data Set" and "Test Data Set" . Some projects have merged both and some they have kept separate. What is the rationale behind having two data sets. Any advise will be helpful thanks

        • Re: Train Data & Test Data in Data science
          rschmitz

          Hi Emma,

           

          Thanks for reaching out to us on PI Square.

           

          This question isn't strictly a PI question, but I'll do my best to help clarify it for you. In machine learning, you're generally building an algorithm that can make predictions. In order to do that, you need a "Training Data Set" for which you know the answers. Let's say I'm trying to classify flowers into either tulips or roses. I might feed my model data on the petal color, width, and length and then subsequently tell the model that this data (this set of color, width, and length) is a tulip. We do this a bunch of times and the model will then use that data to create an algorithm to guess whether a flower is a tulip or rose. This Training Data Set should be representative of what you want to predict.

           

          Now that the model has built out this algorithm, you want to test it out and see if it can make predictions well, but you don't want to show it the data it's already seen (Training Data Set) as it's likely to do pretty well on it, after all the model was built using that data. This is where you're "Test Data Set" comes in.

           

          Your Test Data Set is going to be new data that the model/algorithm hasn't seen and that you know the answers to. You want to feed the algorithm some colors, lengths, and widths again, but this time you want the algorithm to tell you what the flower is. So the algorithm makes some educated guess, and you can check the output of the algorithm against the answers you know to be correct to determine some kind of %correct indicator and refine the model as needed.

           

          So in essence, the "Training Data Set" and the "Test Data Set" have the same requirements - a set of parameters you can feed to a model and the correct answer for the classification of that data. (Barring some more complex topics outside the scope of a forum) It doesn't really matter whether you train the model your "Test Data" and then test on your "Training Data" as long as both data sets are relatively similar and you know the answers, which is why you'll see some examples where the training and test data sets are combined and some that have dedicated training sets and dedicated test sets. What you call them ("training data" vs "test data") doesn't matter all that much so much as that you aren't testing with the same data that you trained with.

           

          Hope this helps clarify some things. Wikipedia also has a solid background page for this kind of background information as well.

           

          Cheers,

          Rob

          • Re: Train Data & Test Data in Data science
            flost

            Hi Emma,

            in my opinion it is a mater of validating your model. If you use only training data for testing you can´t be sure that the model does not just rememer the already known data. Normaly you will check the results with unknown data. it could be that the model just rememders what happend.

            Kind Regards

            Florian