Target and Partitioning

Target

The target variable has two attributes, “default” and “non-default” usually denoted with “1” and “0” respectively. CreditScoring allows the user to select the event attribute. Event is actually the condition that we are trying to model, in case of credit scoring an occurrence of delinquency of payment defined as a default.

By selecting event and non-event attributes, we get a clearer picture of the percentage of events (or defaults) in the modelling data set.

There are some general industry best practices as well as heuristic rules that can help us determine the necessary number of events (defaults) and their relative percentage in the data set. As a rule of thumb, a data set should contain at least 1000 events (defaults) if we want to split it (partition) into a development, validation and a test sample, and roughly 200 events if we plan to use only development and validation samples. What are these samples?

Automated Partitioning

Imported data is usually split into 3 datasets:

  • -Development
  • -Validation
  • -Test

Development Sample - This term denotes a set of data used to determine the potentially predictive relationship between the input variables and the target variable. In statistical modelling that we are dealing with, a development sample is used to “fit” a model that can be used to predict a response value of the target variable from one or more input variables.

The validation sample is used to "test" the model in the in order to limit problems like overfitting and give an insight on how the model will perform on a subset of the modelling data set that acts like “new” or previously unknown data (for example, new loan applications).

The test sample is meant for one exact purpose – to provide the best independent measure of the quality of the model.

And this is the reason we are partitioning data into three (possibly two) samples, despite the fact that the more accurate model would be created if we used the whole data set as a development sample.

Assuming that input variables are moderately predictive – that they have moderate effect on the value of target variable and the number of defaults (events) is over 2500 – CreditScoring by TrimTab will automatically suggest a 40:30:30 split among the Development/Validation/Test samples.

The modelling data set is ideally split into three partitions. However, based on the number of defaults, CreditScoring will suggest other splits - for example 70:30 split (for modelling sets between 1400 and 2500 defaults) in favour of the Development/Validation samples.

If the input variables are highly predictive, one can get away with a very low number (or a low percentage) of defaults to successfully connect these events with the available input variables. CreditScoring by TrimTab suggests an 80:20 Development/Validation split for data sets where there are between 900 and 1400 defaults.

For extreme cases where there are less than 900 defaults, the software suggests using exclusively the Development sample (100:0 split)

Partitioning Import

CreditScoring allows the user to import partitioning from an external source which gives the user flexibility to utilise existing analytical tools and propagate the analysis into the credit scoring process.