Chapter 2 NASA SVM [ML|PY]
Acronym disambiguation with machine learning
A personal project using python and a public NASA dataset related to acronyms in public NASA white paper abstracts.
My source of motivation comes from the suggestion in the dataset description:
This was found to be a suitable dataset for training disambiguation models that use the context of the surrounding sentences to predict the correct meaning of the acronym. The prototype machine-learning models created from this dataset have not been released.
So I decided to make my own prototype model and was inspired by an acronym disambiguation white paper from Courant Institute, NYU (Turtel and Shasha 2007).
2.1 Description
Goal: Use an SVM and statistical analysis (of words/context) to properly classify ambiguous acronym definitions based on provided data.
- Acronym disambiguation is the process of determining the correct expansion/definition of an acronym in a given context
- These specific acronyms have multiple definitions, making them ambiguous when undefined
- Ambiguous sentences contain undefined acronyms, whereas unambiguous sentences contain both the acronyms and their expansions
Example:
In the given NASA abstracts, the acronym IMF is used with two separate definitions:
- Interplanetary Magnetic Field (32 instances)
- Intrinsic Mode Functions (14 instances)
Given an ambiguous sentence containing an ambiguous acronym:
Expressed in the IMFs, they have well-behaved Hilbert Transforms from which instantaneous frequencies can be calculated.
The algorithm guesses what IMF stands for (in this case, Intrinsic Mode Functions) based on three different contexts: ambiguous, unambiguous, and total (the whole abstract).
2.1.1 Data
Each line in processed_acronyms.jsonl
is an acronym found to have more than one definition; there are 484.
However, according to the dataset’s description, this isn’t quality data – insofar as what is defined as a “proper” alternate definition/expansion (see Future Work section), along with a couple of wonky results too (see below).
For example, “TOMS” isn’t really an ambiguous acronym:
- Total Ozone Mapping Spectrometer
- Ozone Mapping Spectrometer (
This algorithm works for any legitimate ambiguous acronym within the dataset.
Example usage:
#!/usr/bin/env bash
python3 nasa_svm.py <acronym> [--v]
I could’ve tailored the algorithm to sift through – then run on – every legitimate acronym, but for brevity’s sake I selected 10 acronyms with 2 definitions, 5 acronyms with 3 definitions, and 1 acronym with 4 definitions (roughly paralleling the distribution):
Acronyms
1. "US": 'United States', 'Upper Stage'
2. "SST": 'Sea Surface Temperature', 'Shear Stress Transport'
3. "IMF": 'Interplanetary Magnetic Field', 'Intrinsic Mode Functions'
4. "VMS": 'Vertical Motion Simulator', 'Visual Motion Simulator'
5. "RMS": 'Remote Manipulator System', 'Root Mean Square'
6. "DOE": 'Department of Energy', 'Design of Experiments'
7. "NAS": 'National Airspace System', 'Numerical Aerodynamic Simulation'
8. "LET": 'Linear Energy Transfer', 'Link Evaluation Terminal'
9. "MLS": 'Microwave Limb Sounder', 'Microwave Landing System'
10. "RCS": 'Reaction Control System', 'Radar Cross Section'
11. "ISO": 'Infrared Space Observatory', 'International Standards Organization', 'Imaging Spectrometric Observatory'
12. "CM": 'Crew Module', 'Command Module', 'Configuration Management'
13. "PEM": 'Pressurized Excursion Module', 'Proton Exchange Membrane', 'Pacific Exploratory Mission'
14. "CRM": 'Common Research Model', 'Cockpit Resource Management', 'Crew Resource Management'
15. "LCC": 'Launch Control Center', 'Launch Commit Criteria', 'Life Cycle Cost'
16. "ATM": 'Air Traffic Management', 'Asynchronous Transfer Mode', 'Apollo Telescope Mount', 'Airborne Topographic Mapper'
2.2 Algorithm
Skeleton:
INPUT: Acronym from NASA abstract
Filter all relevant abstracts
FOR each acronym definition:
Find and extract all sentences containing acronym
Separate into ambiguous and unambiguous
Remove both the definitions and acronyms
END
Randomly sample test sentences containing ambiguous acronym
Extract and remove these from training set
Build feature vector of surrounding meaningful words for context
Concat sentences for each definition as the n documents for tf-idf
Train multi-class linear SVCs to yield word frequency coefficients
Three models: ambiguous, unambiguous, and total contexts
FOR each test sentence:
Create feature vector and feed into model for prediction
Grade predictions
END
OUTPUT: results in csv format for analysis
CSV Output:
The acronym along with the accuracies for the ambiguous, unambiguous, and total (combined) contexts.
python3 nasa_svm.py DOE
## DOE 0.9491525423728814 0.9661016949152542 0.9661016949152542
Accuracies are used as the scoring metric as it’s equivalent to the micro-weighted aggregate F-beta score given the class imbalance in the data (see Analysis).
Verbose Output:
Provides more insight such as the number of training examples for each definition, the confusion matrices, and F1-scores for each model (to show performance in each class).
time python3 nasa_svm.py ISO --v
## -------------------------
## Test set: (n = 30)
## Infrared Space Observatory: 23
## International Standards Organization: 2
## Imaging Spectrometric Observatory: 5
##
## Ambiguous n: 120 slices: [93, 15, 13]
## Unambiguous n: 105 slices: [80, 14, 12]
## Combined n: 712 slices: [544, 99, 70]
##
## Guess (ambiguous) MCM & F1:
## [[[ 5 2]
## [ 3 20]]
##
## [[28 0]
## [ 1 1]]
##
## [[21 4]
## [ 2 3]]]
## Infrared Space Observatory: 0.888888888888889
## International Standards Organization: 0.6666666666666666
## Imaging Spectrometric Observatory: 0.5
##
## Guess (unambiguous) MCM & F1:
## [[[ 6 1]
## [ 4 19]]
##
## [[27 1]
## [ 0 2]]
##
## [[21 4]
## [ 2 3]]]
## Infrared Space Observatory: 0.8837209302325583
## International Standards Organization: 0.8
## Imaging Spectrometric Observatory: 0.5
##
## Guess (combined) MCM & F1:
## [[[ 2 5]
## [ 1 22]]
##
## [[28 0]
## [ 1 1]]
##
## [[24 1]
## [ 4 1]]]
## Infrared Space Observatory: 0.8800000000000001
## International Standards Organization: 0.6666666666666666
## Imaging Spectrometric Observatory: 0.28571428571428575
##
## Accuracy (ambiguous): 0.8
## Accuracy (unambiguous): 0.8
## Accuracy (combined): 0.8
##
## real 0m4.267s
## user 0m3.655s
## sys 0m1.080s
A bag of words model is implemented for the test samples, one for each of the context vocabularies. This allows for proper feature vectors to be created for classification.
The SVM is implemented with scikit-learn
as a support vector classifier (SVC) using a linear kernel.
Model parameters (e.g., C
) were chosen from the aforementioned white paper, therefore fine-tuning via cross-validation, etc., is not included here but proposed in the TODO section for when future improvements are implemented.
The SVC uses a one-vs-one (OVO) shape for the decision function (see under the hood), which in the case of binary classification always considers the distance from the hyperplane.
- I.e., how “deep” the data point is in a specific class’ area
This also contributes to why the algorithm works so well for 2 definitions but poorly for 3-4 (see Conclusions).
2.3 Analysis
Given the slices shown in verbose output, almost every acronym seems to have a dominant definition which appears 2-5 times more often than the others.
Since the test sentences are randomly sampled, I made sure that each definition is included for every test set.
This class imbalance, along with the sample sizes for the training sets being pretty small, causes trouble when the model tries classifying acronyms with more than 2 definitions.
I also added a classifier trained with all the sentences in the abstracts for each acronym as a combined context to see if stripping “noise” helped. My guess is that since the training sets are small, every bit of text matters.
- Collecting only the surrounding words in sentences containing acronyms (as opposed to using the whole abstract for example) should work best with a larger training set
2.3.1 Scripts
For example, running the script 5 times and writing to csv files:
#!/bin/zsh
for acronym in US SST IMF VMS RMS DOE NAS LET MLS RCS ISO CM PEM CRM LCC ATM
do
printf 'Writing to ./nasa-svm_data/results/output_%s.txt\n' "$acronym"
for i in {1..5}
do
python3 nasa_svm.py $acronym >> nasa-svm_data/results/output_${acronym}.csv
done
done
Quick analysis using python:
import csv
import pandas as pd
from glob import glob
= glob("nasa-svm_data/results/*.csv")
file_list
for file in file_list:
with open(file, newline='') as csvfile:
= pd.DataFrame(csv.reader(csvfile, delimiter=' ', quotechar='|'),
buf =['acronym', 'acc_amb', 'acc_unamb','acc_comb'])
columnsif file == file_list[0]:
= buf
grades else:
= pd.concat([grades, buf], axis=0)
grades
= grades.astype({'acc_amb':float, 'acc_unamb':float, 'acc_comb':float})
grades
# Take grouped averages and sort by ambiguous
= grades.groupby(['acronym'])[['acc_amb', 'acc_unamb', 'acc_comb']].mean()
avgs = avgs.sort_values(['acc_amb'], ascending=False)
avgs
# Average accuracies (33 randomly sampled test groups)
print(avgs)
## acc_amb acc_unamb acc_comb
## acronym
## IMF 0.990676 0.904429 0.990676
## MLS 0.982290 0.974813 0.979142
## RCS 0.976874 0.974482 0.990431
## US 0.964912 0.974482 0.998405
## LET 0.964349 0.941176 0.998217
## DOE 0.963020 0.954802 0.955316
## SST 0.939394 0.909091 0.944056
## NAS 0.926815 0.913665 0.923957
## RMS 0.921212 0.943434 0.929293
## LCC 0.746212 0.645833 0.700758
## ISO 0.744444 0.762626 0.795960
## VMS 0.741414 0.888889 0.751515
## PEM 0.609626 0.597148 0.661319
## CM 0.608586 0.624579 0.693603
## CRM 0.582888 0.545455 0.549020
## ATM 0.449811 0.462121 0.482008
# Average accuracies for 2 definitions
print(avgs.head(10)[['acc_amb', 'acc_unamb', 'acc_comb']].mean())
## acc_amb 0.937576
## acc_unamb 0.913621
## acc_comb 0.941025
## dtype: float64
# Average accuracies for 3 definitions (and one 4)
print(avgs.tail(5)[['acc_amb', 'acc_unamb', 'acc_comb']].mean())
## acc_amb 0.598465
## acc_unamb 0.623638
## acc_comb 0.627493
## dtype: float64
2.4 Conclusions
Training the model on entire abstracts provided a marginal increase in accuracy over immediate contexts despite greatly increasing the vocabulary, which hints that most of it is noise. This also makes sense intuitively, given that NASA white paper abstracts are similar in structure.
For 2 definitions this algorithm performs quite well given sparse vocabularies formed from small data.
For more than 2 definitions we run into some problems, the main ones being small training sets and imbalanced class distributions.
There aren’t enough instances to properly train the models, but this is out of our hands. Simply put, the models need more examples of immediate context to capture the greater complexity presented in acronyms with 3-4 definitions – which in turn would provide better hyperplanes to separate the definitions.
Also given an OVO decision function, we view multi-label classification as several related binary classification tasks:
It’s harder to distinguish the remaining 2-3 “lesser-known” definitions from the “main/common” definition which has the highest frequency; the common definition “encroaches,” which tends towards more false negatives for the common and more false negatives/positives between the lesser-knowns.
2.4.1 TODO
Results could be improved by implementing an unbiased classifier, perhaps with a one-vs-all/one-vs-rest (OVA/OVR) decision function (as opposed to the OVO used earlier).
By default, the model assigns equal class weights because it assumes the data is evenly distributed between classes.
For example, with an acronym that has 3 definitions:
- There is an approximate class label ratio of , but some are different (e.g., )
- Using the
class_weight='balanced'
hyperparameter, the SVC decreases the weight of records in the “common class” in order to balance the weight of the whole class (e.g., 2:1:1 gives[0.75, 1.5, 1.5]
)- Would have to compare with manually calculated class weights based on each distribution (e.g., 2:1:1 gives
[1, 2, 2]
)
- Would have to compare with manually calculated class weights based on each distribution (e.g., 2:1:1 gives
- Perform comparative metrics for OVO vs OVR given acronyms with 3-4 definitions
2.5 Future Work?
There are a few hundred other acronyms available to play with. The following examples of acronyms could serve as inspiration for another (more complex) project than this one – perhaps for:
- Learning grammar for either correction or more nuanced guesses (see CFD, TRMM, MSS, TES)
- Developing some similarity-based merging of classifications (see AVIRIS, NASA, JPL, CCD)
- Maybe even a combo of the aforementioned (see GEO)
List of Acronyms - Nuanced Definitions:
Including but not limited to:
1. "CFD": 'Computational Fluid Dynamics', 'Computational fluid dynamics'
2. "TRMM": 'Tropical Rainfall Measuring Mission', 'Tropical Rainfall Measurement Mission', 'Tropical Rain Measuring Mission'
3. "MSS": 'Mobile Satellite Service', 'Mobile Servicing System'
4. "AVIRIS": 'Airborne Visible/Infrared Imaging Spectrometer', 'Airborne Visible and Infrared Imaging Spectrometer'
5. "TES": 'Thermal Emission Spectrometer', 'Tropospheric Emission Spectrometer'
6. "NASA": 'National Aeronautics & Space Administration', 'National Aeronautic and Space Administration'
7. "JPL": 'Jet Propulsion Laboratory', 'Jet Propulsion Lab'
8. "CCD": 'Charge Coupled Device', 'Charge Coupled Devices'
9. "GEO": 'Geosynchronous Earth Orbit', 'geosynchronous Earth orbit', 'Group on Earth Observations', 'Geostationary Earth Orbit', 'geostationary Earth orbit'
Useful Metadata:
Each acronym definition also has “NASA terms” attached to it, for example IMF comes with some “additional context” which can be utilized.
1. Interplanetary Magnetic Field: 'SOLAR MAGNETIC FIELD', 'SOLAR WIND', 'INTERPLANETARY MAGNETIC FIELDS', 'MAGNETIC FLUX', 'MAGNETIC PROBES', 'SPACE PROBES'
2. Intrinsic Mode Functions: 'HILBERT TRANSFORMATION', 'SPECTRAL EMISSION', 'DECOMPOSITION', 'SPECTRUM ANALYSIS', 'TIME FUNCTIONS', 'FREQUENCY DISTRIBUTION', 'NONLINEARITY', 'TIME SERIES ANALYSIS', 'NONLINEAR SYSTEMS', 'NONLINEAR EQUATIONS'
Source Code
The source code for this project can be downloaded or viewed and copied to the clipboard below:
Download nasa_svm.py
import json
import re
import nltk
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, f1_score
from sklearn.svm import SVC
from functools import reduce
import random
import operator
import sys
#nltk.download('punkt')
#nltk.download('stopwords')
def preprocess(sentences, slices, names = None, tfidf = True, context = None):
if context is None: context = []
for i, s in enumerate(slices):
if i == 0:
' '.join(sentences[:s]))
context.append(else:
' '.join(sentences[slices[i-1]:(s + slices[i-1])]))
context.append(= cleaner(context)
context if tfidf:
if names is None:
= tf_idf(context)
vectors, vocab else:
= tf_idf(context, names)
vectors, vocab return vectors, vocab
else:
return context
def cleaner(context):
= [re.sub(r'\w*\d\w*', '', w) for w in context]
context = [re.sub(r'[^A-Za-z0-9 ]+', '', w) for w in context]
context = [re.sub(r'\s+', ' ', w) for w in context]
context return context
def filtered(sentence, keywords):
for k in keywords:
= sentence.replace(k, '')
sentence return sentence
def tf_idf(context, names = None):
= TfidfVectorizer(stop_words='english',
vectorizer =r'(?u)\b[A-Za-z]+\b')
token_pattern= vectorizer.fit_transform(context)
vector = vectorizer.vocabulary_
vocab = vectorizer.get_feature_names_out()
tokens = pd.DataFrame(data=vector.toarray(), index=names, columns=tokens)
df_tfidf return df_tfidf, vocab
def bow(context, names = None, vocab = None):
= CountVectorizer(stop_words='english',
vectorizer =r'(?u)\b[A-Za-z]+\b',
token_pattern=vocab)
vocabulary= vectorizer.fit_transform(context)
vector = vectorizer.get_feature_names_out()
tokens = pd.DataFrame(data=vector.toarray(), index=names, columns=tokens)
df_bow return df_bow
= nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer = ['US','SST','IMF','VMS','RMS','DOE','NAS','LET',
acronym_list 'MLS','RCS','ISO','CM','PEM','CRM','LCC','ATM']
with open('nasa-svm_data/processed/processed_acronyms.jsonl', 'r') as json_acronyms, open('nasa-svm_data/raw/results_merged.jsonl', 'r') as json_corpus:
if len(sys.argv) > 1:
if sys.argv[1] in acronym_list:
= sys.argv[1]
acronym else:
print("Example Usage: python3 nasa_svm.py <acronym> [--v]")
print(f'Acronyms: {acronym_list}')
exit()else:
= random.choice(acronym_list)
acronym
= list(json_acronyms)
json_list = list(json_corpus)
json_list_corpus = [], [], [], [], []
ambiguous, sentences, combined, names, terms = [], [], [], []
slices, slices_amb, slices_cmb, slices_amb_ind = [], [], [], []
guess, guess_amb, guess_cmb, grades
for json_str in json_list:
= json.loads(json_str)
result if result['acronym'] == acronym:
= [result['definition'] + " (" + result['acronym'] + ")",
keywords 'acronym'] + " (" + result['definition'] + ")",
result['acronym'], result['definition']]
result['definition'])
names.append(result[# For all the abstracts where acronym variant is found
for index in result['corpus_positions']:
= json_list_corpus[index]
json_abstract # Take abstract, split each into sentences
= json.loads(json_abstract)['description']
abstract = tokenizer.tokenize(str(abstract))
parsed # Filter sentences and feed into split bucket
for sentence in parsed:
# All of the sentences with keywords removed
combined.append(sentence)-1] = filtered(combined[-1], keywords)
combined[if sentence.find(keywords[0]) != -1 or sentence.find(keywords[1]) != -1:
sentences.append(sentence)# Remove acronyms and their definitions
-1] = filtered(sentences[-1], keywords)
sentences[# Change 'if' below back to elif to have every sentence containing
# both definition/acronym pair AND undefined acronyms
# considered as only part of [sentences] (NOT also ambiguous)
# also change sentences[-1] below back to sentence
if sentence.find(keywords[2]) != -1:
-1])
ambiguous.append(sentences[-1] = filtered(ambiguous[-1], keywords)
ambiguous[# Ambiguous slices w.r.t. definitions
slices_amb_ind.append(index)elif sentence.find(keywords[2]) != -1:
ambiguous.append(sentence)-1] = filtered(ambiguous[-1], keywords)
ambiguous[# Ambiguous slices w.r.t. definitions
slices_amb_ind.append(index)# Slices based on which acronym sentence belongs to
if len(slices_amb) == 0:
len(slices_amb_ind))
slices_amb.append(len(sentences))
slices.append(len(combined))
slices_cmb.append(elif len(slices_amb) == 1:
len(slices_amb_ind) - slices_amb[0] + 1)
slices_amb.append(len(sentences) - slices[0] + 1)
slices.append(len(combined) - slices_cmb[0] + 1)
slices_cmb.append(else:
# More than 2 defintiions
len(slices_amb_ind) - reduce(operator.add, slices_amb) + 1)
slices_amb.append(len(sentences) - reduce(operator.add, slices) + 1)
slices.append(len(combined) - reduce(operator.add, slices_cmb) + 1)
slices_cmb.append(# Optional additional contexutal terms for each definition
#terms.append(' '.join(map(str, cleaner(json.loads(json_abstract)['subject.NASATerms']))))
# Ensure at least 2 samples from each definition are extracted
= False
good_batch while not good_batch:
= random.sample(ambiguous, round(len(ambiguous)/5))
testing_set = [], []
testing_ind, key # Determine which acronym the random sample belongs to
for t in testing_set:
if any(t in testing_set for t in ambiguous):
testing_ind.append(ambiguous.index(t))for count, i in enumerate(slices_amb):
= len(ambiguous) - sum(slices_amb[count:], -1)
start if ambiguous.index(t) in range(start, i+start):
key.append(names[count])= {i:key.count(i) for i in names}
key_counts = {k:v for (k,v) in key_counts.items() if v > 1}
good_count if len(key_counts) == len(good_count):
= True
good_batch
# Update slices
= {i:key.count(i) for i in names}
key_counts = [a_i - b_i for a_i, b_i in zip(slices_amb, key_counts.values())]
slices_amb = [a_i - b_i for a_i, b_i in zip(slices_cmb, key_counts.values())]
slices_cmb
# Remove sample from training set
for t in testing_set:
if any(t in testing_set for t in ambiguous):
ambiguous.remove(t)if any(t in testing_set for t in combined):
combined.remove(t)
# Build models
= preprocess(sentences, slices, names)
context, vocab = SVC(C=1., kernel='linear', decision_function_shape='ovo')
model
model.fit(context, names)
= preprocess(ambiguous, slices_amb, names)
context_amb, vocab_amb = SVC(C=1., kernel='linear', decision_function_shape='ovo')
model_amb
model_amb.fit(context_amb, names)
= preprocess(combined, slices_cmb, names)
context_cmb, vocab_cmb = SVC(C=1., kernel='linear', decision_function_shape='ovo')
model_cmb
model_cmb.fit(context_cmb, names)
# Prepare tests for prediction
= cleaner(testing_set)
testing_set # Bag of words for all test sentences
= bow(testing_set, vocab=vocab)
df = bow(testing_set, vocab=vocab_amb)
df_ambig = bow(testing_set, vocab=vocab_cmb)
df_comb
for i in range(len(testing_set)):
guess.append(model.predict(df.loc[i].to_frame().T))
guess_amb.append(model_amb.predict(df_ambig.loc[i].to_frame().T))
guess_cmb.append(model_cmb.predict(df_comb.loc[i].to_frame().T))= pd.DataFrame([guess_amb, guess, guess_cmb,key],
results =['Guess (ambiguous)', 'Guess (unambiguous)',
index'Guess (combined)', 'Correct Answer'])
for i in results.index[:3]:
'Correct Answer'] == results.loc[i], True, False))
grades.append(np.where(results.loc[
if len(sys.argv) > 2:
if sys.argv[2] == '--v':
print("-------------------------")
print(f'Test set: (n = {len(testing_set)})')
for key, value in key_counts.items():
print(f'{key}: {value}')
print(f'\nAmbiguous n: {len(ambiguous)} \t slices: {slices_amb}')
print(f'Unambiguous n: {len(sentences)} \t slices: {slices}')
print(f'Combined n: {len(combined)} \t slices: {slices_cmb}')
if len(names) > 2:
for i in results.index[:3]:
print(f'\n{i} MCM & F1:')
print(multilabel_confusion_matrix(list(results.loc['Correct Answer']),
list(results.loc[i]), labels=names))
= dict(zip(names, f1_score(list(results.loc['Correct Answer']),
f list(results.loc[i]), average=None, labels=names)))
for key, value in f.items():
print(f'{key}: {value}')
else:
for i in results.index[:3]:
print(f'\n{i} CM & F1:')
print(confusion_matrix(list(results.loc['Correct Answer']), list(results.loc[i])))
= dict(zip(names, f1_score(list(results.loc['Correct Answer']),
f list(results.loc[i]), average=None, labels=names)))
for key, value in f.items():
print(f'{key}: {value}')
print(f'\nAccuracy (ambiguous): {sum(grades[0])/len(testing_set)}')
print(f'Accuracy (unambiguous): {sum(grades[1])/len(testing_set)}')
print(f'Accuracy (combined): {sum(grades[2])/len(testing_set)}')
else:
print(acronym, sum(grades[0])/len(testing_set), sum(grades[1])/len(testing_set), sum(grades[2])/len(testing_set))