Datasets Used
I used an entrepreneurship measuring dataset, a diabetic analysis dataset, and an employee retention dataset. These datasets had various interesting attributes that I thought would be good to look through and run the algorithm on.
Outcome
After running the Fold-R++ algorithm, I was able to generate a ruleset for the 3 datasets. The diabetic dataset had the most conclusive and intuitively correct output. It listed higher glucose levels, higher BMI, and pregnancy as the main factors in understanding whether a patient has diabetes or not.
The other two datasets had a long list of assertions that the ruleset was using. The ruleset itself was also much smaller than the diabetes dataset. It seemed to either overfit to the training data, or it could not isolate a few key attributes that had a higher impact on the output. This is understandable as the raw data may have had many inconsistencies/missing values/etc that weren't handled beforehand.
Takeaway
This algorithm is a neat tool to easily extract general rulesets and conclusions from the data, which would otherwise be tedious or unfeasible for the user to do manually, especially on larger datasets.
Built With
- fold-rpp
- python
Log in or sign up for Devpost to join the conversation.