Skip to content

Conversation

@mulhod
Copy link
Contributor

@mulhod mulhod commented Jul 31, 2023

Save RSMTool modeler in place of SKLL learner and make attributes like feature_info, train_predictions_mean, trim_min, etc., available in the modeler instance; modify fast_predict to be able to utilize these attributes in a backwards-compatible way so that fewer arguments can be passed to fast_predict.

This is a draft at the moment! No tests have been added quite yet. I would like to get some feedback on the direction here. Here are some points that I anticipate some discussion around:

  1. fast_predict can now be called with just the features and the modeler instance. Parameters like df_feature_info, trim_min, and h1_mean, if left unspecified, will be retrieved from the modeler (provided it was trained with this or a future version of RSMTool).
  2. Scaling and trimming in fast_predict can now be turned on/off via the scale/trim parameters, both of which are set to False by default. This is partially to allow for backwards-compatibility: If scaling-related parameters are not specified (currently), no scaling is done. But the attributes will always be there in the future (in the modeler object). Thus, this mechanism can be used to explicitly turn off trimming or scaling.
  3. The change is backwards-compatible! Old RSMTool models (serialized SKLL learners) can still be loaded. Use of fast_predict, furthermore, is backwards-compatible. All of the attributes of the modeler that would be used for things like trimming and scaling can be explicitly passed in in the same way they are now in the call to fast_predict. Thus, old models and old fast_predict calls should work without requiring any change at all.
  4. This change results in not saving SKLL models directly. Instead RSMTool modeler instances would be saved. These objects contain the SKLL model under the learner attribute, so we wouldn't really be losing much since it could always be referenced or saved by itself later.

Closes #631

@mulhod mulhod requested a review from desilinguist July 31, 2023 19:31
@mulhod mulhod self-assigned this Jul 31, 2023
@mulhod
Copy link
Contributor Author

mulhod commented Jul 31, 2023

ICYI, here's what I was doing to test that this works (at least for loading RSMTool modeler objects, so I haven't tested the backwards compatibility yet):

❯ ../../../env/bin/rsmtool config_rsmtool.json -f
Output directory: /Users/mmulholland/Documents/rsmtool/rsmtool/examples/rsmtool
WARNING: /Users/mmulholland/Documents/rsmtool/rsmtool/examples/rsmtool already contains a non-empty 'output' directory. Th
e generated report might contain unexpected information from a previous experiment.
Saving configuration file.
Reading in all data from files.
Preprocessing all features.
Pre-processing training and test set features
Saving training and test set data to disk.
Training LinearRegression model.
Running analyses on training set.
Generating training and test set predictions.
Processing train set predictions.
Processing test set predictions.
Scaling the coefficients and saving them to disk
Running prediction analyses.
Starting report generation.
Merging sections
Exporting HTML
WARNING: Alternative text is missing on 8 image(s).
❯ ../../../env/bin/ipython
Python 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:41) [Clang 15.0.7 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import rsmtool

In [2]: m = rsmtool.rsmtool.Modeler.load_from_file("output/ASAP2.model")

In [3]: rsmtool.rsmpredict.fast_predict({"DISCOURSE": 4.93806460126142, "ORGANIZATION": -0.0846667513334603, "GRAMMAR": -0.316793975540994, "MECHANICS": 4.65591397849462}, m)
Out[3]:
{'raw': 3.4671587960793446,
 'scale': 3.4876896893346814,
 'raw_trim': 3.4671587960793446,
 'raw_trim_round': 3,
 'scale_trim': 3.4876896893346814,
 'scale_trim_round': 3}

@desilinguist
Copy link
Collaborator

Hmm, One thing that immediately occurs to me is this – saving a modeler instead of a SKLL model – would be a completely backwards-incompatible change.

At the very least, we would need to modify how rsmpredict and rsmexplain work at the very least since they need access to the model. Also, we would need to change notebook sections that try to inspect the model object. Probably lots of stuff in the documentation too.

@desilinguist desilinguist marked this pull request as draft July 31, 2023 19:41
@mulhod
Copy link
Contributor Author

mulhod commented Jul 31, 2023

Oh, yeah, rsmpredict is being modified now. I started with fast_predict right now as a POC. But the idea would be to have logic that tries to load a Learner object and, if that fails, load a Modeler object (or vice versa).

@desilinguist
Copy link
Collaborator

Ah, okay, Yeah, we can probably just modify Modeler.load_from_file() to return a Modeler instance irrespective of whether the file on disk contains a serialized SKLL model or a serialized modeler. And then the rest should just work?

@mulhod
Copy link
Contributor Author

mulhod commented Jul 31, 2023

I'm going to push a change that should make that aspect a little bit clearer.

@mulhod mulhod force-pushed the feature/fast_predict_enhancement branch 2 times, most recently from c04d69c to 3bafca3 Compare August 1, 2023 04:25
mulhod added 5 commits August 2, 2023 00:44
…ke feature_info, train_prediction_mean, trim_min, etc., available in the modeler instance; modify fast_predict to be able to utilize these attributes in a backwards-compatible way so that fewer arguments can be passed to fast_predict
…ast_predict for better backwards-compatibility
@mulhod mulhod force-pushed the feature/fast_predict_enhancement branch from d61cb4a to c3a25b6 Compare August 2, 2023 14:22
@mulhod mulhod changed the title Draft: Enhancements to fast_predict Enhancements to fast_predict Aug 2, 2023
@mulhod
Copy link
Contributor Author

mulhod commented Aug 2, 2023

@desilinguist I think this is ready for the part you were going to handle (converting unit test models).

@mulhod mulhod marked this pull request as ready for review August 2, 2023 15:17
@codecov
Copy link

codecov bot commented Aug 2, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.06% 🎉

Comparison is base (2ed3733) 93.21% compared to head (5924248) 93.28%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #632      +/-   ##
==========================================
+ Coverage   93.21%   93.28%   +0.06%     
==========================================
  Files          33       33              
  Lines        4921     4972      +51     
==========================================
+ Hits         4587     4638      +51     
  Misses        334      334              
Files Changed Coverage Δ
rsmtool/modeler.py 97.64% <100.00%> (+0.05%) ⬆️
rsmtool/rsmpredict.py 99.35% <100.00%> (+0.17%) ⬆️
rsmtool/rsmtool.py 99.21% <100.00%> (+0.05%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mulhod mulhod force-pushed the feature/fast_predict_enhancement branch from 0fa4fa2 to d095c00 Compare August 2, 2023 17:15
Copy link
Collaborator

@desilinguist desilinguist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @mulhod! I had some suggestions/comments, mostly minor.

@mulhod
Copy link
Contributor Author

mulhod commented Aug 3, 2023

Thanks, @blongwill! I believe that I have addressed your concerns. Please take another look!

@mulhod mulhod requested a review from blongwill August 3, 2023 15:53
@desilinguist
Copy link
Collaborator

@mulhod I just updated all the relevant model files. I just realized we haven't yet looked at the documentation. I think there may be few places where the .model file may be described (the outputs generated by rsmtool). Can you please take a look and update all the relevant places?

@mulhod
Copy link
Contributor Author

mulhod commented Aug 3, 2023

@desilinguist I have done some grepping around in the documentation. I found one instance so far that needed to be changed. I'll continue digging a bit, though. My first iteration was looking specifically for pieces that touched on literally a .model file output. But, perhaps search for SKLL, etc., would be a good idea too.

Co-authored-by: Nitin Madnani <nmadnani@ets.org>
@desilinguist desilinguist self-requested a review August 3, 2023 19:26
Copy link
Collaborator

@desilinguist desilinguist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🎉

@blongwill
Copy link
Contributor

Thanks, @blongwill! I believe that I have addressed your concerns. Please take another look!

Looks great Matt! Thanks for making those updates!

@desilinguist desilinguist merged commit cdc97ae into main Aug 4, 2023
@delete-merged-branch delete-merged-branch bot deleted the feature/fast_predict_enhancement branch August 4, 2023 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance fast_predict by requiring fewer arguments (just the model)

4 participants