How Khan Academy is using Machine Learning to Assess Student Mastery

02 Nov 2011

See discussion on Hacker News and Reddit.


The Khan Academy is well known for its extensive library of over 2600 video lessons. It should also be known for its rapidly-growing set of now 225 exercises — outnumbering stitches on a baseball — with close to 2 million problems done each day.

To determine when a student has finished a certain exercise, we award proficiency to a user who has answered at least 10 problems in a row correctly — known as a streak. Proficiency manifests itself as a gold star, a green patch on teachers’ dashboards, a requirement for some badges (eg. gain 3 proficiencies), and a bounty of “energy” points. Basically, it means we think you’ve mastered the concept and can move on in your quest to know everything.

It turns out that the streak model has serious flaws.

First, if we define proficiency as your chance of getting the next problem correct being above a certain threshold, then the streak becomes a poor binary classifier. Experiments conducted on our data showed a significant difference between students who take, say, 30 problems to get a streak vs. 10 problems right off the bat — the former group was much more likely to miss the next problem after a break than the latter.

False positives is not our only problem, but also false negatives. One of our largest source of complaints is from frustrated students who lost their streak. You get 9 correct, make a silly typo, and lose all your hard-earned progress. In other words, the streak thinks that users who have gotten 9 right and 1 wrong are at the same level as those who haven’t started.

In Search of a Better Model

These findings, presented by one of our full-time volunteers Jace, led us to investigate whether we could construct a better proficiency model. We prototyped a constant acceleration “rocketship” model (with heavy gnomes that slow you down on wrong answers), but ultimately decided that a prudent first step would be to just abstract away the streak model with the notion of “fill up the bar”.

We went from displaying the user’s current streak (bug not intended; could not find another screenshot):

Old streak bar with buggy number display

to this:

New progress bar empty

and when full:

New progress bar full

This gave us greater freedom to experiment with different underlying models without disrupting the interface.

Conversations with the team led me to conceive of applying machine learning to predict the likelihood of getting the next problem correct, and use that as the basis for a new proficiency model. Basically, if we think you’re more than $t$% likely to get the next problem correct, for some threshold $t$, we’ll say you’re proficient.

I started off by hacking together a naive Bayes binary classifier modified to give a probability estimate. I trained this on a few days’ worth of problem logs, and initial results were promising — the most striking being that fewer problem were needed to attain the same level of accuracy.

What do I mean by accuracy? We define it as

$P(\text{next problem correct} | \text{just gained proficiency})$P(\text{next problem correct} | \text{just gained proficiency})

which is just notation desperately trying to say ”Given that we just gained proficiency, what’s the probability of getting the next problem correct?”

However, naive Bayes is typically used for classification — the task of determining which discrete category a data point belongs to — rather than for regression — returning a continuous value (in our case, a probability estimate in $[0,1]$).

So, our full-time volunteer Jace, who is much more versed in statistics and machine learning, used R to quickly prototype and evaluate different machine learning algorithms and feature sets. R is the de-facto programming language for statistical computing and comes pre-packaged with data analysis and machine learning tools.

To evaluate the different algorithms, input features, and thresholds, we came up with some metrics to gauge for desirable characteristics:

  • Mean problems done to reach proficiency — ideally we like to minimize this so that students can spend less time rote grinding on problems they know well, and move on to other concepts.
  • $P(\text{next problem correct} | \text{just gained proficiency})$ — Unfortunately, this is hard to correctly measure in our offline data set due to the streak-of-10 bias: students may loosen up after they gain proficiency and spend less time on subsequent problems.
  • Proficiency Rate — The percent of proficiencies attained per user-exercise pair. Again, this is hard to measure because of the streak bias.
  • Confusion matrix for predicted next problem correct — This is for comparing binary classifiers on their accuracy in predicting the outcome of any answer in a user’s response history. We build up the confusion matrix, and from that extract two valuable measures of the performance of a binary classifier.

We tested various models, including naive Bayes, support vector machines, a simple 10-out-of-last-11-correct model, and logistic regression. Based on the metrics above, we settled on…

Using Logistic Regression as a Proficiency Model

(Feel free to skip this section if you’re not technically inclined.)

Logistic regression is usually used as a classifier that gives a reasonable probability estimate of each category — exactly our requirement. It’s so simple, let’s derive it.

Let’s say we have the values of $n$ input features (eg. percent correct), and we stuff them into a vector $\textbf{x}$. Let’s say we also happen to know how much each feature makes it more likely that the user is proficient, and stuff those weights into a vector $\textbf{w}$. We can then take the weighted sum of the input features, plus a pre-determined constant $w_0$ to correct for any constant bias, and call that $z$:

$z = w_0 + \sum_{i=1}^n w_ix_i$z = w_0 + \sum_{i=1}^n w_ix_i

Now if we set $x_0 = 1$, we can write that compactly as a linear algebra dot product:

$z = \mathbf{w}^T\mathbf{x}$z = \mathbf{w}^T\mathbf{x}

Already, you can see that the higher $z$ is, the more likely the user is to be proficient. To obtain our probability estimate, all we have to do is “shrink” $z$ into the interval $(0,1)$. We want negative values of $z$ to map into $(0, 0.5)$ and positive values to fall in $[0.5, 1)$. We can do this by plugging $z$ into a sigmoid function — in particular, we’ll use the logistic function:

$h(z) = \frac{1}{1+e^{-z}}$h(z) = \frac{1}{1+e^{-z}}

And that’s it! $h(z)$ is the probability estimate that logistic regression spits out.

The tricky bit is in determining the values of the weight vector $\textbf{w}$ — that is, training logistic regression so that $h$, aka. the hypothesis function in machine learning terminology, gives us a good probability estimate. For brevity I’ll spare you the details, but suffice to know that there are plenty of existing libraries to do that.

So that raises the question, which features did we use?

  • ewma_3 and ewma_10 — Exponentially-weighted moving average. This is just math-talk for an average where we give greater weight to more recent values. It’s handy because it can be implemented recursively as $S_t = \alpha \times y + (1 - \alpha) \times S_{t-1}$, where $\alpha$ is the weighting factor, $y$ is the most recent value, and $S_{t-1}$ is the previous exponential moving average. We set $\alpha$ to 0.333 and 0.1 for ewma_3 and ewma_10 respectively.
  • current_streak — This turned out to be a rather weak signal and we’ll be discarding it in favour of other features in the future.
  • log_num_done$\log(\text{number of problems done})$. We don’t try to predict until at least one problem has been done.
  • log_num_missed$\log(\text{number of problems missed} + 1)$
  • percent_correct$\frac{\text{number of problems correct}}{\text{number problems done}}$

As for the proficiency threshold, we chose 94% based on our metrics.

Now for some Python code. To compute the exponentially-weighted moving average:

1 def exp_moving_avg(self, weight):
2     ewma = EWMA_SEED
3 
4     for i in reversed(xrange(self.total_done)):
5         ewma = weight * self.get_answer_at(i) + (1 - weight) * ewma
6 
7     return ewma

and for the actual logistic regression hypothesis function:

 1 class AccuracyModel(object):
 2 
 3     # ... snip ...
 4 
 5     def predict(self):
 6         """
 7         Returns: the probabilty of the next problem correct using logistic
 8             regression.
 9         """
10 
11         # We don't try to predict the first problem (no user-exercise history)
12         if self.total_done == 0:
13             return PROBABILITY_FIRST_PROBLEM_CORRECT
14 
15         # Get values for the feature vector X
16         ewma_3 = self.exp_moving_avg(0.333)
17         ewma_10 = self.exp_moving_avg(0.1)
18         current_streak = self.streak()
19         log_num_done = math.log(self.total_done)
20         log_num_missed = math.log(self.total_done - self.total_correct() + 1)
21         percent_correct = float(self.total_correct()) / self.total_done
22 
23         weighted_features = [
24             (ewma_3, params.EWMA_3),
25             (ewma_10, params.EWMA_10),
26             (current_streak, params.CURRENT_STREAK),
27             (log_num_done, params.LOG_NUM_DONE),
28             (log_num_missed, params.LOG_NUM_MISSED),
29             (percent_correct, params.PERCENT_CORRECT),
30         ]
31 
32         X, weight_vector = zip(*weighted_features)  # unzip the list of pairs
33 
34         return AccuracyModel.logistic_regression_predict(
35             params.INTERCEPT, weight_vector, X)
36 
37     # See http://en.wikipedia.org/wiki/Logistic_regression
38     @staticmethod
39     def logistic_regression_predict(intercept, weight_vector, X):
40         # TODO(david): Use numpy's dot product fn when we support numpy
41         dot_product = sum(itertools.imap(operator.mul, weight_vector, X))
42         z = dot_product + intercept
43 
44         return 1.0 / (1.0 + math.exp(-z))

There’s another interesting problem here — how do you display that probability value on the progress bar? We try to linearize the display and distribute it evenly across the bar. Since it’s 4 am right now, I’ll just give you the code for it (it’s well-commented) and won’t make any helpful explanatory graphs (unless people request it ;)).

 1 class InvFnExponentialNormalizer(object):
 2     """
 3     This is basically a function that takes an accuracy prediction (probability
 4     of next problem correct) and attempts to "evenly" distribute it in [0, 1]
 5     such that progress bar appears to fill up linearly.
 6 
 7     The current algorithm is as follows:
 8     Let
 9         f(n) = probabilty of next problem correct after doing n problems,
10         all of which are correct.
11     Let
12         g(x) = f^(-1)(x)
13     that is, the inverse function of f. Since f is discrete but we want g to be
14     continuous, unknown values in the domain of g will be approximated by using
15     an exponential curve to fit the known values of g. Intuitively, g(x) is a
16     function that takes your accuracy and returns how many problems correct in
17     a row it would've taken to get to that, as a real number. Thus, our
18     progress display function is just
19         h(x) = g(x) / g(consts.PROFICIENCY_ACCURACY_THRESHOLD)
20     clamped between [0, 1].
21 
22     The rationale behind this is that if you don't get any problems wrong, your
23     progress bar will increment by about the same amount each time and be full
24     right when you're proficient (i.e. reach the required accuracy threshold).
25     """
26 
27     def __init__(self, accuracy_model, proficiency_threshold):
28         X, Y = [], []
29 
30         for i in itertools.count(1):
31             accuracy_model.update(correct=True)
32             probability = accuracy_model.predict()
33 
34             X.append(probability)
35             Y.append(i)
36 
37             if probability >= proficiency_threshold:
38                 break
39 
40         self.A, self.B = exponential_fit(X, Y)
41         # normalize the function output so that it outputs 1.0 at the
42         # proficency threshold
43         self.A /= self.exponential_estimate(proficiency_threshold)
44 
45     def exponential_estimate(self, x):
46         return self.A * math.exp(self.B * x)
47 
48     def normalize(self, p_val):
49         # TODO(david): Use numpy clip
50         def clamp(value, minval, maxval):
51             return sorted((minval, value, maxval))[1]
52 
53         return clamp(self.exponential_estimate(p_val), 0.0, 1.0)

Now, until Google App Engine supports NumPy, the implementation for exponential_fit is just the derivative of the least-squares cost.

The full, uncut, unaltered code is available at our Kiln repo.

Showdown: Logistic Regression vs. The Streak

The metrics may tell us that logistic regression wins, but being the illogical, squishy human beings that we are, we yearned for an intuitive understanding of the unique behaviour of the different models. I developed an internal tool to simulate exercise responses and visualise the prediction of different models. Here’s a highlight reel of the salient characteristics.

As expected, order matters. Both models will weigh problems done more recently higher than earlier problems. What may be surprising is the relative importance:

Position Matters

Logistic regression seems to care much less than streak.

Both models monotonically increase confidence the more responses of the same type they receive:

More positives increase likelihood of correct response

Logistic regression also recognises consistently spotty performance:

Longer alternating right-wrongs makes logistic regression sadder

Logistic regression takes into account prior performance. So, getting lots correct is always a good thing, and you’ll be able to recover faster from a wrong answer if you were previously doing well. Contrast with the streak model, which loses all memory after a single incorrect answer.

Initial correct answers allow for an easier recovery

This could also work against you. If you’ve gotten lots of wrong answers, you’ll need to do more work to convince logistic regression that you’re actually competent. This mitigates one of the issues we had with the streak, where we found that there was a significant difference in actual proficiency for those getting a streak immediately vs. after 30 problems.

Multiple wrong answers will count

Could this be overly harsh for struggling students? That’s a question we are actively investigating, and as a stopgap measure we only keep the last 20 problems as history. This compromise has an insignificant effect on logistic regression’s predictive accuracy, but it lets us sleep knowing that a student will not be damned for life if they were doing some unusual exploration and got 10 problems in a row wrong.

Due to 4 am, I don’t have an interactive demo on this page, but it won’t be hard to add it. If you would like to play with this please say so.

Results

This was a fairly large change that we, understandably, only wanted to deploy to a small subset of users. This was facilitated by Bengineer Kamen’s GAE/Bingo split-testing framework for App Engine. Crucially, it allowed us to measure conversions as a way of gathering more accurate statistics on actual usage data.

The experiment has been running for 6 days thus far with 10% of users using the new logistic regression proficiency model. Before I reveal anything else, see a screenshot of GAE/Bingo in action (from a few hours ago):

Proficiency gained all - 228.97% (streak) vs 268.42% (accuracy)

The graph above shows the results over time, so you can see when trends have stabilised.

Now what you’ve been waiting for, our current statistics (5 am PST, Nov. 2) show that, for the new model, we have:

  • 20.8% more proficiencies earned:
  • 15.7% more new exercises attempted:
  • 4.4 less problems done (26% less) per proficiency:
  • Essentially the same accuracy at proficiency:
  • Higher accuracy attained among a set of 3 pre-chosen easy problems. Jace came up with this statistic to gauge any actual differences in learning. The basic idea is: If accuracy as determined by logistic regression is a good approximation of competence, then higher attained accuracies would be indicative of greater competence. Note the precipitous drop at 94% for the accuracy model — this is due to the proficiency threshold set at 94% for logistic regression, so once a user reaches that level, we tell them to move on. (A streak of 10 with no wrong answers nets an accuracy of 96.7%.)
  • Slightly higher accuracy attained for a set of 10 pre-chosen hard problems. Going above and beyond the call of duty seems much less popular here, among accuracy model participants.
  • P(do another problem | just answered incorrectly) not affected
  • 11.7% more proficiencies earned for the hard problems
  • 14.8% more proficiencies earned for the easy problems

In high level terms, we increased overall interest — more new exercises attempted, fewer problems done per proficiency — without lowering the bar for proficiency — P(next problem correct | just gained proficiency) was roughly the same for both groups. Further, it seemed that overall learning, as measured by the distribution of accuracies obtained, went up slightly under the new model.

Optimistically, we hypothesise that our gains are from moving students quicker off exercises they’re good at, while making them spend more time on concepts in which they need more practice. To confirm or deny this…

In the Pipeline

…we will look into truly where the new proficiencies are coming from. We are also interested in seeing if there is any variation in knowledge retention — in particular, we want to know if P(next problem correct | took a few days break) is affected.

This is just the end of the beginning for us. We wish to investigate and possibly implement:

  • Stochastic gradient descent for online learning of logistic regression

  • …which would allow adaptive models per user and per exercise. Should we bump up the proficiency threshold for users who find the exercises too easy?

  • On a similar note, could we define a fitness function that takes into account both accuracy and student frustration, and find the optimal time to tell the student to move on? Could this allow us to maximize student learning by maximizing accuracy across many exercises?

  • Model improvements. Here are some things we still need to try:

    • Incorporate more features, such as time spent per problem, time since last problem done, and user performance on similar exercises.
    • Experiment with non-linear feature transformations and combinations. Eg. $\frac{\text{ewma}}{\log(1-\text{ewma})}$
    • Along with the above, apply regularization to prevent overfitting (thanks Andrew Ng and ml-class!)
    • Train and use separate models for the first 5 problems vs. those after that.

This work to create an accurate predictor has many other applications than just to power the proficiency meter:

  • Determine if the user is struggling, and if so suggest a video to watch, using some hints, or taking a break.
  • Determine the optimal date to schedule a review for spaced repetition learning.
  • Tailor a custom mixed-question review session that addresses weak areas.

Stay tuned for a continuation blog post if we find more interesting results!

Obligatory Recruiting Plug

Think you can do better? Well, I agree! I’m sure you know lots of ways to improve what we did. Good news: we’re open-source and hiring!

We welcome contributors to our exercises and exercise framework on GitHub. Some of our best exercises were created by volunteers: check out this awesome derivative intuition exercise created by Bengineer Eater.

Another reason I love working at the Khan Academy is the passionate and talented team. Our lead developer, Bengineer Kamens, is committed to our productivity and well-being. He Bengineers internal refactorings, tools, and spends much of his time getting new developers up to speed. Without his Bengineering, it would not have been possible to collect all this interesting data. Also, if you ever have a question about jQuery, you could just ask John Resig here.

Do you want to make 0.1% improvements in ad click-thru rates for the rest of your life, or come with us and change the world of education?

Also, if you were wondering, we are not based in the UK, Canada, or Australia… my Canadian heritage compels me to spell “-our” and “-ise” when it’s not code. :P


Update (Nov 3, 2 am PST)

Thank you all for your suggestions and feedback!

There’s some interesting discussion on Hacker News and Reddit.

Update (Nov 12)

Having ran the experiment for more than two weeks now, we analysed 10 days’ data to see if longer-term knowledge retention was affected. It turns out that students are slightly more likely to answer correctly after taking a break under the new model:

These results are encouraging. It shows that the new model attempts to address one of the core problems with the streak — the variability of student success rates after taking a break — while at the same time increasing proficiency rates. Thus, we have reason to conclude that the accuracy model is just a better model of student proficiency.

This information gave us the confidence to roll out from 10% to 100% of users. We have now officially launched the logistic regression proficiency model site-wide!