De-Jargonizer

How were the word frequency levels determined?

Over 90 million words were counted using a crawler that counted all words in all ~250,000 articles published in the BBC sites (including science related channels) during the years 2012-2015. These articles were crawled using scrapy framework (http://scrapy.org/). The crawling included only articles, and ignored advertisements, reader comments, phone numbers, websites, and emails. All words were extracted to an excel sheet, creating a dictionary of every new word found, and the number of appearances in the corpus.

Overall, ~500,000 word types were ordered by number of appearances. These word types refer to each word: for example, value and values are each unique word types, even though they belong to the same word family.

Frequently used words may receive thousands of appearances and jargon may have only few appearances: e.g. season, pressure, and current received over 10,000 appearances, pollution 1,608 appearances, gene 389 appearances and specifications, 90 appearances. These appearances show the frequency of the words in everyday news sites, open to and designed for the public.

General vocabulary has traditionally been divided into high frequency (1,000±3,000 word families) and low frequency (above 9,000-word family level). More recently, the literature has also presented a mid-frequency group (the 3000-9000-word family level) created from general vocabulary (Nation, 2006; Schmitt & Schmitt, 2014). Using this as a guideline, we created levels of high frequency, mid-frequency and jargon for our corpus word list to create the cutoffs for our jargon identifier (for a detailed version and validation, see (Rakedzon et al., 2017)). Validation was conducted by statistically comparing results to existing tools and lists in the field (Cobb, 2016; Nation, 2012).

References:
Cobb T. Compleat Lexical Tutor v.8 [computer program] [Internet]. 2016 [cited 2016 Aug 14]. (link)
Nation, I. (2006). How Large a Vocabulary is Needed For Reading and Listening? Canadian Modern Language Review, 63(1), 59–82.
Nation ISP. The BNC/COCA word family lists 25,000 [Internet]. v. 2012 [cited 2016 Aug 13]. (link)
Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language Teaching, 47(4), 484–503.
Rakedzon, T., Segev, E., Chapnik, N., Yosef, R., & Baram-Tsabari, A. (2017). Automatic jargon identifier for scientists engaging with the public and science communication educators. PloS one, 12(8), e0181742.

Which literature did you use for evaluating the results?

Baram-Tsabari, A., & Lewenstein, B. V. (2013). An Instrument for Assessing Scientists’ Written Skills in Public Communication of Science. Science Communication, 35(1), 56–85. (link)
Hu, M., & Nation, I. S. P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language, 23, 403–430. (link)
Laufer, B., & Ravenhorst-Kalovski, G. C. (2010). Lexical threshold revisited: Lexical text coverage, learners’ vocabulary size and reading comprehension, 22(1). (link)
Nation, I. (2006). How Large a Vocabulary is Needed For Reading and Listening? Canadian Modern Language Review, 63(1), 59–82. (link)
Nation, I. S. (2001). Learning Vocabulary in Another Language. New York: Cambridge University Press. (link)
Rakedzon, T., & Baram-Tsabari, A. (2016). Assessing and improving L2 graduate students’ popular science and academic writing in an academic writing course. Educational Psychology. (link)
Rakedzon, T., Segev, E., & Baram-Tsabari, A. (2016). An automatic jargon identifier for scientists engaging with the public and for science communication educators. Manuscript in Preparation.
Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language Teaching, 47(4), 484–503. (link)
Sharon, A. J., & Baram-Tsabari, A. (2013). Measuring mumbo jumbo: A preliminary quantification of the use of jargon in science communication. Public Understanding of Science (Bristol, England), 23(5), 528–546. (link)

Who developed it?

The program was developed by:
• Associate Professor Ayelet Baram-Tsabari, head of the Science Communication research group at the Faculty of Education in Science and Technology,
Technion – Israel Institute of Technology
• Tzipora Rakedzon, lecturer and coordinator of academic writing for graduate students at the Department of Humanities and Arts at the Technion – Israel Institute of Technology
• Dr. Elad Segev, lecturer at the Department of Applied Mathematics at the Holon Institute of Technology
• Noam Chapnik and Roy Yosef, computer science students at the Holon Institute of Technology, Israel
• Izabell Gershanik and Daniel Sokol, department of science students at the Holon Institute of Technology, Israel

Source code can be found on: https://github.com/NoamAndRoy/JargonProject.

Development

How were the word frequency levels determined?

Which literature did you use for evaluating the results?

Who developed it?