Thanks for reaching out to us on PI Square.
This question isn't strictly a PI question, but I'll do my best to help clarify it for you. In machine learning, you're generally building an algorithm that can make predictions. In order to do that, you need a "Training Data Set" for which you know the answers. Let's say I'm trying to classify flowers into either tulips or roses. I might feed my model data on the petal color, width, and length and then subsequently tell the model that this data (this set of color, width, and length) is a tulip. We do this a bunch of times and the model will then use that data to create an algorithm to guess whether a flower is a tulip or rose. This Training Data Set should be representative of what you want to predict.
Now that the model has built out this algorithm, you want to test it out and see if it can make predictions well, but you don't want to show it the data it's already seen (Training Data Set) as it's likely to do pretty well on it, after all the model was built using that data. This is where you're "Test Data Set" comes in.
Your Test Data Set is going to be new data that the model/algorithm hasn't seen and that you know the answers to. You want to feed the algorithm some colors, lengths, and widths again, but this time you want the algorithm to tell you what the flower is. So the algorithm makes some educated guess, and you can check the output of the algorithm against the answers you know to be correct to determine some kind of %correct indicator and refine the model as needed.
So in essence, the "Training Data Set" and the "Test Data Set" have the same requirements - a set of parameters you can feed to a model and the correct answer for the classification of that data. (Barring some more complex topics outside the scope of a forum) It doesn't really matter whether you train the model your "Test Data" and then test on your "Training Data" as long as both data sets are relatively similar and you know the answers, which is why you'll see some examples where the training and test data sets are combined and some that have dedicated training sets and dedicated test sets. What you call them ("training data" vs "test data") doesn't matter all that much so much as that you aren't testing with the same data that you trained with.
Hope this helps clarify some things. Wikipedia also has a solid background page for this kind of background information as well.
in my opinion it is a mater of validating your model. If you use only training data for testing you can´t be sure that the model does not just rememer the already known data. Normaly you will check the results with unknown data. it could be that the model just rememders what happend.
You should never use the training data set to test the model.
Global Solutions Group
(w) (281) 920-6186
(m) (713) 409-8163
"It is a capital mistake to theorise before one has data" - Sherlock Holmes
"Do or do not, there is no try." - Yoda
"De omnibus dubitandum." - Rene Descartes
"It is easier to disintegrate an atom than a preconceived idea." - Albert Einstein
"What gets measured, gets done." - W.E. Deming