Development

How were the word frequency levels determined?

Over 90 million words were counted using a crawler that counted all words in all ~250,000 articles published in the BBC sites (including science related channels) during the years 2012-2015. These articles were crawled using scrapy framework (http://scrapy.org/). The crawling included only articles, and ignored advertisements, reader comments, phone numbers, websites, and emails. All words were extracted to an excel sheet, creating a dictionary of every new word found, and the number of appearances in the corpus.

Overall, ~500,000 word types were ordered by number of appearances. These word types refer to each word: for example, value and values are each unique word types, even though they belong to the same word family.

Frequently used words may receive thousands of appearances and jargon may have only few appearances: e.g. season, pressure, and current received over 10,000 appearances, pollution 1,608 appearances, gene 389 appearances and specifications, 90 appearances. These appearances show the frequency of the words in everyday news sites, open to and designed for the public.

The boundary between high frequency words has been noted in the literature to be at the 2000 word level (Nation, 2006; Schmitt & Schmitt, 2014), and low frequency from the 9,000-10,000 level. Using this as a guideline, we created levels of high frequency, mid-frequency and jargon for our corpus word list to create the cutoffs for our jargon identifier (for a detailed version and validation, see (Rakedzon et al., 2016)).

References:
Nation, I. (2006). How Large a Vocabulary is Needed For Reading and Listening? Canadian Modern Language Review, 63(1), 59–82.
Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language Teaching, 47(4), 484–503.
Rakedzon, T., Segev, E., & Baram-Tsabari, A. (2016). An automatic jargon identifier for scientists engaging with the public and for science communication educators. Manuscript in Preparation.

Which literature did you use for evaluating the results?

Baram-Tsabari, A., & Lewenstein, B. V. (2013). An Instrument for Assessing Scientists’ Written Skills in Public Communication of Science. Science Communication, 35(1), 56–85. (link)
Hu, M., & Nation, I. S. P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language, 23, 403–430. (link)
Laufer, B., & Ravenhorst-Kalovski, G. C. (2010). Lexical threshold revisited: Lexical text coverage, learners’ vocabulary size and reading comprehension, 22(1). (link)
Nation, I. (2006). How Large a Vocabulary is Needed For Reading and Listening? Canadian Modern Language Review, 63(1), 59–82. (link)
Nation, I. S. (2001). Learning Vocabulary in Another Language. New York: Cambridge University Press. (link)
Rakedzon, T., & Baram-Tsabari, A. (2016). Assessing and improving L2 graduate students’ popular science and academic writing in an academic writing course. Educational Psychology. (link)
Rakedzon, T., Segev, E., & Baram-Tsabari, A. (2016). An automatic jargon identifier for scientists engaging with the public and for science communication educators. Manuscript in Preparation.
Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language Teaching, 47(4), 484–503. (link)
Sharon, A. J., & Baram-Tsabari, A. (2013). Measuring mumbo jumbo: A preliminary quantification of the use of jargon in science communication. Public Understanding of Science (Bristol, England), 23(5), 528–546. (link)