Chapter 2 NASA SVM [ML|PY]

Acronym disambiguation with machine learning

A personal project using python and a public NASA dataset related to acronyms in public NASA white paper abstracts.

My source of motivation comes from the suggestion in the dataset description:

This was found to be a suitable dataset for training disambiguation models that use the context of the surrounding sentences to predict the correct meaning of the acronym. The prototype machine-learning models created from this dataset have not been released.

So I decided to make my own prototype model and was inspired by an acronym disambiguation white paper from Courant Institute, NYU (Turtel and Shasha 2007).

2.1 Description

Goal: Use an SVM and statistical analysis (of words/context) to properly classify ambiguous acronym definitions based on provided data.

  • Acronym disambiguation is the process of determining the correct expansion/definition of an acronym in a given context
    • These specific acronyms have multiple definitions, making them ambiguous when undefined
  • Ambiguous sentences contain undefined acronyms, whereas unambiguous sentences contain both the acronyms and their expansions

Example:

In the given NASA abstracts, the acronym IMF is used with two separate definitions:

  1. Interplanetary Magnetic Field (32 instances)
  2. Intrinsic Mode Functions (14 instances)

Given an ambiguous sentence containing an ambiguous acronym:

Expressed in the IMFs, they have well-behaved Hilbert Transforms from which instantaneous frequencies can be calculated.

The algorithm guesses what IMF stands for (in this case, Intrinsic Mode Functions) based on three different contexts: ambiguous, unambiguous, and total (the whole abstract).

2.1.1 Data

Each line in processed_acronyms.jsonl is an acronym found to have more than one definition; there are 484.

However, according to the dataset’s description, this isn’t quality data – insofar as what is defined as a “proper” alternate definition/expansion (see Future Work section), along with a couple of wonky results too (see below).

For example, “TOMS” isn’t really an ambiguous acronym:

  1. Total Ozone Mapping Spectrometer
  2. Ozone Mapping Spectrometer (

This algorithm works for any legitimate ambiguous acronym within the dataset.

Example usage:

#!/usr/bin/env bash
python3 nasa_svm.py <acronym> [--v]

I could’ve tailored the algorithm to sift through – then run on – every legitimate acronym, but for brevity’s sake I selected 10 acronyms with 2 definitions, 5 acronyms with 3 definitions, and 1 acronym with 4 definitions (roughly paralleling the distribution):

Acronyms

1. "US": 'United States', 'Upper Stage'

2. "SST": 'Sea Surface Temperature', 'Shear Stress Transport'

3. "IMF": 'Interplanetary Magnetic Field', 'Intrinsic Mode Functions'

4. "VMS": 'Vertical Motion Simulator', 'Visual Motion Simulator'

5. "RMS": 'Remote Manipulator System', 'Root Mean Square'

6. "DOE": 'Department of Energy', 'Design of Experiments'

7. "NAS": 'National Airspace System', 'Numerical Aerodynamic Simulation'

8. "LET": 'Linear Energy Transfer', 'Link Evaluation Terminal'

9. "MLS": 'Microwave Limb Sounder', 'Microwave Landing System'

10. "RCS": 'Reaction Control System', 'Radar Cross Section'

11. "ISO": 'Infrared Space Observatory', 'International Standards Organization', 'Imaging Spectrometric Observatory'

12. "CM": 'Crew Module', 'Command Module', 'Configuration Management'

13. "PEM": 'Pressurized Excursion Module', 'Proton Exchange Membrane', 'Pacific Exploratory Mission'

14. "CRM": 'Common Research Model', 'Cockpit Resource Management', 'Crew Resource Management'

15. "LCC": 'Launch Control Center', 'Launch Commit Criteria', 'Life Cycle Cost'

16. "ATM": 'Air Traffic Management', 'Asynchronous Transfer Mode', 'Apollo Telescope Mount', 'Airborne Topographic Mapper'

2.2 Algorithm

Skeleton:

INPUT: Acronym from NASA abstract

Filter all relevant abstracts
FOR each acronym definition:
    Find and extract all sentences containing acronym
    Separate into ambiguous and unambiguous
    Remove both the definitions and acronyms
END
Randomly sample test sentences containing ambiguous acronym
    Extract and remove these from training set
Build feature vector of surrounding meaningful words for context
    Concat sentences for each definition as the n documents for tf-idf
Train multi-class linear SVCs to yield word frequency coefficients
    Three models: ambiguous, unambiguous, and total contexts
FOR each test sentence:
    Create feature vector and feed into model for prediction
    Grade predictions
END

OUTPUT: results in csv format for analysis

CSV Output:

The acronym along with the accuracies for the ambiguous, unambiguous, and total (combined) contexts.

python3 nasa_svm.py DOE
## DOE 0.9491525423728814 0.9661016949152542 0.9661016949152542

Accuracies are used as the scoring metric as it’s equivalent to the micro-weighted aggregate F-beta score given the class imbalance in the data (see Analysis).

Verbose Output:

Provides more insight such as the number of training examples for each definition, the confusion matrices, and F1-scores for each model (to show performance in each class).

time python3 nasa_svm.py ISO --v
## -------------------------
## Test set: (n = 30)
## Infrared Space Observatory: 23
## International Standards Organization: 2
## Imaging Spectrometric Observatory: 5
## 
## Ambiguous n: 120      slices: [93, 15, 13]
## Unambiguous n: 105    slices: [80, 14, 12]
## Combined n: 712   slices: [544, 99, 70]
## 
## Guess (ambiguous) MCM & F1:
## [[[ 5  2]
##   [ 3 20]]
## 
##  [[28  0]
##   [ 1  1]]
## 
##  [[21  4]
##   [ 2  3]]]
## Infrared Space Observatory: 0.888888888888889
## International Standards Organization: 0.6666666666666666
## Imaging Spectrometric Observatory: 0.5
## 
## Guess (unambiguous) MCM & F1:
## [[[ 6  1]
##   [ 4 19]]
## 
##  [[27  1]
##   [ 0  2]]
## 
##  [[21  4]
##   [ 2  3]]]
## Infrared Space Observatory: 0.8837209302325583
## International Standards Organization: 0.8
## Imaging Spectrometric Observatory: 0.5
## 
## Guess (combined) MCM & F1:
## [[[ 2  5]
##   [ 1 22]]
## 
##  [[28  0]
##   [ 1  1]]
## 
##  [[24  1]
##   [ 4  1]]]
## Infrared Space Observatory: 0.8800000000000001
## International Standards Organization: 0.6666666666666666
## Imaging Spectrometric Observatory: 0.28571428571428575
## 
## Accuracy (ambiguous): 0.8
## Accuracy (unambiguous): 0.8
## Accuracy (combined): 0.8
## 
## real 0m4.267s
## user 0m3.655s
## sys  0m1.080s

A bag of words model is implemented for the test samples, one for each of the context vocabularies. This allows for proper feature vectors to be created for classification.

The SVM is implemented with scikit-learn as a support vector classifier (SVC) using a linear kernel.

Model parameters (e.g., C) were chosen from the aforementioned white paper, therefore fine-tuning via cross-validation, etc., is not included here but proposed in the TODO section for when future improvements are implemented.

The SVC uses a one-vs-one (OVO) shape for the decision function (see under the hood), which in the case of binary classification always considers the distance from the hyperplane.

  • I.e., how “deep” the data point is in a specific class’ area

This also contributes to why the algorithm works so well for 2 definitions but poorly for 3-4 (see Conclusions).

2.3 Analysis

Given the slices shown in verbose output, almost every acronym seems to have a dominant definition which appears 2-5 times more often than the others.

Since the test sentences are randomly sampled, I made sure that each definition is included for every test set.

This class imbalance, along with the sample sizes for the training sets being pretty small, causes trouble when the model tries classifying acronyms with more than 2 definitions.

I also added a classifier trained with all the sentences in the abstracts for each acronym as a combined context to see if stripping “noise” helped. My guess is that since the training sets are small, every bit of text matters.

  • Collecting only the surrounding words in sentences containing acronyms (as opposed to using the whole abstract for example) should work best with a larger training set

2.3.1 Scripts

For example, running the script 5 times and writing to csv files:

#!/bin/zsh
for acronym in US SST IMF VMS RMS DOE NAS LET MLS RCS ISO CM PEM CRM LCC ATM
do
  printf 'Writing to ./nasa-svm_data/results/output_%s.txt\n' "$acronym"
  for i in {1..5}
  do
        python3 nasa_svm.py $acronym >> nasa-svm_data/results/output_${acronym}.csv
  done
done

Quick analysis using python:

import csv
import pandas as pd
from glob import glob

file_list = glob("nasa-svm_data/results/*.csv")

for file in file_list:
  with open(file, newline='') as csvfile:
    buf = pd.DataFrame(csv.reader(csvfile, delimiter=' ', quotechar='|'),
                       columns=['acronym', 'acc_amb', 'acc_unamb','acc_comb'])
    if file == file_list[0]:
      grades = buf
    else:
      grades = pd.concat([grades, buf], axis=0)

grades = grades.astype({'acc_amb':float, 'acc_unamb':float, 'acc_comb':float})

# Take grouped averages and sort by ambiguous
avgs = grades.groupby(['acronym'])[['acc_amb', 'acc_unamb', 'acc_comb']].mean()
avgs = avgs.sort_values(['acc_amb'], ascending=False)

# Average accuracies (33 randomly sampled test groups)
print(avgs)
##           acc_amb  acc_unamb  acc_comb
## acronym                               
## IMF      0.990676   0.904429  0.990676
## MLS      0.982290   0.974813  0.979142
## RCS      0.976874   0.974482  0.990431
## US       0.964912   0.974482  0.998405
## LET      0.964349   0.941176  0.998217
## DOE      0.963020   0.954802  0.955316
## SST      0.939394   0.909091  0.944056
## NAS      0.926815   0.913665  0.923957
## RMS      0.921212   0.943434  0.929293
## LCC      0.746212   0.645833  0.700758
## ISO      0.744444   0.762626  0.795960
## VMS      0.741414   0.888889  0.751515
## PEM      0.609626   0.597148  0.661319
## CM       0.608586   0.624579  0.693603
## CRM      0.582888   0.545455  0.549020
## ATM      0.449811   0.462121  0.482008
# Average accuracies for 2 definitions
print(avgs.head(10)[['acc_amb', 'acc_unamb', 'acc_comb']].mean())
## acc_amb      0.937576
## acc_unamb    0.913621
## acc_comb     0.941025
## dtype: float64
# Average accuracies for 3 definitions (and one 4)
print(avgs.tail(5)[['acc_amb', 'acc_unamb', 'acc_comb']].mean())
## acc_amb      0.598465
## acc_unamb    0.623638
## acc_comb     0.627493
## dtype: float64

2.4 Conclusions

Training the model on entire abstracts provided a marginal increase in accuracy over immediate contexts despite greatly increasing the vocabulary, which hints that most of it is noise. This also makes sense intuitively, given that NASA white paper abstracts are similar in structure.

For 2 definitions this algorithm performs quite well given sparse vocabularies formed from small data.

For more than 2 definitions we run into some problems, the main ones being small training sets and imbalanced class distributions.

There aren’t enough instances to properly train the models, but this is out of our hands. Simply put, the models need more examples of immediate context to capture the greater complexity presented in acronyms with 3-4 definitions – which in turn would provide better hyperplanes to separate the definitions.

Also given an OVO decision function, we view multi-label classification as several related binary classification tasks:

It’s harder to distinguish the remaining 2-3 “lesser-known” definitions from the “main/common” definition which has the highest frequency; the common definition “encroaches,” which tends towards more false negatives for the common and more false negatives/positives between the lesser-knowns.

2.4.1 TODO

Results could be improved by implementing an unbiased classifier, perhaps with a one-vs-all/one-vs-rest (OVA/OVR) decision function (as opposed to the OVO used earlier).

By default, the model assigns equal class weights because it assumes the data is evenly distributed between classes.

For example, with an acronym that has 3 definitions:

  1. There is an approximate class label ratio of  3:1:1\ \sim 3:1:1, but some are different (e.g.,  1:1:3\ \sim 1:1:3)
  2. Using the class_weight='balanced' hyperparameter, the SVC decreases the weight of records in the “common class” in order to balance the weight of the whole class (e.g., 2:1:1 gives [0.75, 1.5, 1.5])
    • Would have to compare with manually calculated class weights based on each distribution (e.g., 2:1:1 gives [1, 2, 2])
  3. Perform comparative metrics for OVO vs OVR given acronyms with 3-4 definitions

2.5 Future Work?

There are a few hundred other acronyms available to play with. The following examples of acronyms could serve as inspiration for another (more complex) project than this one – perhaps for:

  • Learning grammar for either correction or more nuanced guesses (see CFD, TRMM, MSS, TES)
  • Developing some similarity-based merging of classifications (see AVIRIS, NASA, JPL, CCD)
  • Maybe even a combo of the aforementioned (see GEO)

List of Acronyms - Nuanced Definitions:

Including but not limited to:

1. "CFD": 'Computational Fluid Dynamics', 'Computational fluid dynamics'

2. "TRMM": 'Tropical Rainfall Measuring Mission', 'Tropical Rainfall Measurement Mission', 'Tropical Rain Measuring Mission'

3. "MSS": 'Mobile Satellite Service', 'Mobile Servicing System'

4. "AVIRIS": 'Airborne Visible/Infrared Imaging Spectrometer', 'Airborne Visible and Infrared Imaging Spectrometer'

5. "TES": 'Thermal Emission Spectrometer', 'Tropospheric Emission Spectrometer'

6. "NASA": 'National Aeronautics & Space Administration', 'National Aeronautic and Space Administration'

7. "JPL": 'Jet Propulsion Laboratory', 'Jet Propulsion Lab'

8. "CCD": 'Charge Coupled Device', 'Charge Coupled Devices'

9. "GEO": 'Geosynchronous Earth Orbit', 'geosynchronous Earth orbit', 'Group on Earth Observations', 'Geostationary Earth Orbit', 'geostationary Earth orbit'

Useful Metadata:

Each acronym definition also has “NASA terms” attached to it, for example IMF comes with some “additional context” which can be utilized.

1. Interplanetary Magnetic Field: 'SOLAR MAGNETIC FIELD', 'SOLAR WIND', 'INTERPLANETARY MAGNETIC FIELDS', 'MAGNETIC FLUX', 'MAGNETIC PROBES', 'SPACE PROBES'

2. Intrinsic Mode Functions: 'HILBERT TRANSFORMATION', 'SPECTRAL EMISSION', 'DECOMPOSITION', 'SPECTRUM ANALYSIS', 'TIME FUNCTIONS', 'FREQUENCY DISTRIBUTION', 'NONLINEARITY', 'TIME SERIES ANALYSIS', 'NONLINEAR SYSTEMS', 'NONLINEAR EQUATIONS'

Source Code

The source code for this project can be downloaded or viewed and copied to the clipboard below:

Download nasa_svm.py


import json
import re
import nltk
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, f1_score
from sklearn.svm import SVC
from functools import reduce
import random
import operator
import sys
#nltk.download('punkt')
#nltk.download('stopwords')

def preprocess(sentences, slices, names = None, tfidf = True, context = None):
    if context is None: context = []
    for i, s in enumerate(slices):
        if i == 0:
            context.append(' '.join(sentences[:s]))
        else:
            context.append(' '.join(sentences[slices[i-1]:(s + slices[i-1])]))
    context = cleaner(context)
    if tfidf:
        if names is None:
            vectors, vocab = tf_idf(context)
        else:
            vectors, vocab = tf_idf(context, names)
        return vectors, vocab
    else:
        return context

def cleaner(context):
    context = [re.sub(r'\w*\d\w*', '', w) for w in context]
    context = [re.sub(r'[^A-Za-z0-9 ]+', '', w) for w in context]
    context = [re.sub(r'\s+', ' ', w) for w in context]
    return context

def filtered(sentence, keywords):
    for k in keywords:
        sentence = sentence.replace(k, '')
    return sentence

def tf_idf(context, names = None):
    vectorizer = TfidfVectorizer(stop_words='english',
                                 token_pattern=r'(?u)\b[A-Za-z]+\b')
    vector = vectorizer.fit_transform(context)
    vocab = vectorizer.vocabulary_
    tokens = vectorizer.get_feature_names_out()
    df_tfidf = pd.DataFrame(data=vector.toarray(), index=names, columns=tokens)
    return df_tfidf, vocab

def bow(context, names = None, vocab = None):
    vectorizer = CountVectorizer(stop_words='english',
                                 token_pattern=r'(?u)\b[A-Za-z]+\b',
                                 vocabulary=vocab)
    vector = vectorizer.fit_transform(context)
    tokens = vectorizer.get_feature_names_out()
    df_bow = pd.DataFrame(data=vector.toarray(), index=names, columns=tokens)
    return df_bow

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
acronym_list = ['US','SST','IMF','VMS','RMS','DOE','NAS','LET',
                'MLS','RCS','ISO','CM','PEM','CRM','LCC','ATM']

with open('nasa-svm_data/processed/processed_acronyms.jsonl', 'r') as json_acronyms, open('nasa-svm_data/raw/results_merged.jsonl', 'r') as json_corpus:

    if len(sys.argv) > 1:
        if sys.argv[1] in acronym_list:
            acronym = sys.argv[1]
        else:
            print("Example Usage: python3 nasa_svm.py <acronym> [--v]")
            print(f'Acronyms: {acronym_list}')
            exit()
    else:
        acronym = random.choice(acronym_list)

    json_list = list(json_acronyms)
    json_list_corpus = list(json_corpus)
    ambiguous, sentences, combined, names, terms = [], [], [], [], []
    slices, slices_amb, slices_cmb, slices_amb_ind = [], [], [], []
    guess, guess_amb, guess_cmb, grades = [], [], [], []

    for json_str in json_list:
        result = json.loads(json_str)
        if result['acronym'] == acronym:
            keywords = [result['definition'] + " (" + result['acronym'] + ")",
                        result['acronym'] + " (" + result['definition'] + ")",
                        result['acronym'], result['definition']]
            names.append(result['definition'])
            # For all the abstracts where acronym variant is found
            for index in result['corpus_positions']:
                json_abstract = json_list_corpus[index]
                # Take abstract, split each into sentences
                abstract = json.loads(json_abstract)['description']
                parsed = tokenizer.tokenize(str(abstract))
                # Filter sentences and feed into split bucket
                for sentence in parsed:
                    # All of the sentences with keywords removed
                    combined.append(sentence)
                    combined[-1] = filtered(combined[-1], keywords)
                    if sentence.find(keywords[0]) != -1 or sentence.find(keywords[1]) != -1:
                        sentences.append(sentence)
                        # Remove acronyms and their definitions
                        sentences[-1] = filtered(sentences[-1], keywords)
                        # Change 'if' below back to elif to have every sentence containing
                        # both definition/acronym pair AND undefined acronyms
                        # considered as only part of [sentences] (NOT also ambiguous)
                        # also change sentences[-1] below back to sentence
                        if sentence.find(keywords[2]) != -1:
                            ambiguous.append(sentences[-1])
                            ambiguous[-1] = filtered(ambiguous[-1], keywords)
                            # Ambiguous slices w.r.t. definitions
                            slices_amb_ind.append(index)
                    elif sentence.find(keywords[2]) != -1:
                        ambiguous.append(sentence)
                        ambiguous[-1] = filtered(ambiguous[-1], keywords)
                        # Ambiguous slices w.r.t. definitions
                        slices_amb_ind.append(index)
            # Slices based on which acronym sentence belongs to
            if len(slices_amb) == 0:
                slices_amb.append(len(slices_amb_ind))
                slices.append(len(sentences))
                slices_cmb.append(len(combined))
            elif len(slices_amb) == 1:
                slices_amb.append(len(slices_amb_ind) - slices_amb[0] + 1)
                slices.append(len(sentences) - slices[0] + 1)
                slices_cmb.append(len(combined) - slices_cmb[0] + 1)
            else:
                # More than 2 defintiions
                slices_amb.append(len(slices_amb_ind) - reduce(operator.add, slices_amb) + 1)
                slices.append(len(sentences) - reduce(operator.add, slices) + 1)
                slices_cmb.append(len(combined) - reduce(operator.add, slices_cmb) + 1)
            # Optional additional contexutal terms for each definition
            #terms.append(' '.join(map(str, cleaner(json.loads(json_abstract)['subject.NASATerms']))))

    # Ensure at least 2 samples from each definition are extracted
    good_batch = False
    while not good_batch:
        testing_set = random.sample(ambiguous, round(len(ambiguous)/5))
        testing_ind, key = [], []
        # Determine which acronym the random sample belongs to
        for t in testing_set:
            if any(t in testing_set for t in ambiguous):
                testing_ind.append(ambiguous.index(t))
                for count, i in enumerate(slices_amb):
                    start = len(ambiguous) - sum(slices_amb[count:], -1)
                    if ambiguous.index(t) in range(start, i+start):
                        key.append(names[count])
        key_counts = {i:key.count(i) for i in names}
        good_count = {k:v for (k,v) in key_counts.items() if v > 1}
        if len(key_counts) == len(good_count):
            good_batch = True

    # Update slices
    key_counts = {i:key.count(i) for i in names}
    slices_amb = [a_i - b_i for a_i, b_i in zip(slices_amb, key_counts.values())]
    slices_cmb = [a_i - b_i for a_i, b_i in zip(slices_cmb, key_counts.values())]

    # Remove sample from training set
    for t in testing_set:
        if any(t in testing_set for t in ambiguous):
            ambiguous.remove(t)
        if any(t in testing_set for t in combined):
            combined.remove(t)

    # Build models
    context, vocab = preprocess(sentences, slices, names)
    model = SVC(C=1., kernel='linear', decision_function_shape='ovo')
    model.fit(context, names)

    context_amb, vocab_amb = preprocess(ambiguous, slices_amb, names)
    model_amb = SVC(C=1., kernel='linear', decision_function_shape='ovo')
    model_amb.fit(context_amb, names)

    context_cmb, vocab_cmb = preprocess(combined, slices_cmb, names)
    model_cmb = SVC(C=1., kernel='linear', decision_function_shape='ovo')
    model_cmb.fit(context_cmb, names)

    # Prepare tests for prediction
    testing_set = cleaner(testing_set)
    # Bag of words for all test sentences
    df = bow(testing_set, vocab=vocab)
    df_ambig = bow(testing_set, vocab=vocab_amb)
    df_comb = bow(testing_set, vocab=vocab_cmb)

    for i in range(len(testing_set)):
        guess.append(model.predict(df.loc[i].to_frame().T))
        guess_amb.append(model_amb.predict(df_ambig.loc[i].to_frame().T))
        guess_cmb.append(model_cmb.predict(df_comb.loc[i].to_frame().T))
    results = pd.DataFrame([guess_amb, guess, guess_cmb,key],
                           index=['Guess (ambiguous)', 'Guess (unambiguous)',
                                  'Guess (combined)', 'Correct Answer'])
    for i in results.index[:3]:
        grades.append(np.where(results.loc['Correct Answer'] == results.loc[i], True, False))

    if len(sys.argv) > 2:
        if sys.argv[2] == '--v':
            print("-------------------------")
            print(f'Test set: (n = {len(testing_set)})')
            for key, value in key_counts.items():
                print(f'{key}: {value}')
            print(f'\nAmbiguous n: {len(ambiguous)} \t slices: {slices_amb}')
            print(f'Unambiguous n: {len(sentences)} \t slices: {slices}')
            print(f'Combined n: {len(combined)} \t slices: {slices_cmb}')
            if len(names) > 2:
                for i in results.index[:3]:
                    print(f'\n{i} MCM & F1:')
                    print(multilabel_confusion_matrix(list(results.loc['Correct Answer']),
                                                      list(results.loc[i]), labels=names))
                    f = dict(zip(names, f1_score(list(results.loc['Correct Answer']),
                                                 list(results.loc[i]), average=None, labels=names)))
                    for key, value in f.items():
                        print(f'{key}: {value}')
            else:
                for i in results.index[:3]:
                    print(f'\n{i} CM & F1:')
                    print(confusion_matrix(list(results.loc['Correct Answer']), list(results.loc[i])))
                    f = dict(zip(names, f1_score(list(results.loc['Correct Answer']),
                                                 list(results.loc[i]), average=None, labels=names)))
                    for key, value in f.items():
                        print(f'{key}: {value}')
            print(f'\nAccuracy (ambiguous): {sum(grades[0])/len(testing_set)}')
            print(f'Accuracy (unambiguous): {sum(grades[1])/len(testing_set)}')
            print(f'Accuracy (combined): {sum(grades[2])/len(testing_set)}')
    else:
        print(acronym, sum(grades[0])/len(testing_set), sum(grades[1])/len(testing_set), sum(grades[2])/len(testing_set))

References

Turtel, Benjamin D., and Dennis Shasha. 2007. “Acronym Disambiguation.” In. https://cs.nyu.edu/media/publications/TR2015-973.pdf.