Skip to main content
  1. Blog/

Analysing the entire Holy Bible in less than a second

·492 words·3 mins

Word frequency analysis is important in quantitative text analysis.

Nushell has a plugin for Polars support.

When installed, you can analyse a large corpus of text very fast:

def bible [] {
    let corp = open --raw /home/lk/Data/king-james-bible.txt | str downcase | split words | polars into-df
    let stop = [i a an and are as at be by for from how in is it of on or that the this to was what when where who will with the] | polars into-df
    let mask = $corp | polars is-in $stop
    let tidy = $corp | polars filter-with ($mask | polars not)
    let freq = $tidy | polars value-counts
    let sort = $freq | polars sort-by count
  $sort
}

Giving:

╭───────┬────────────┬───────╮
│     # │     0      │ count │
├───────┼────────────┼───────┤
│     0 │ endow      │     1 │
│     1 │ clappeth   │     1 │
│     2 │ elishaphat │     1 │
│     3 │ muse       │     1 │
│     4 │ makaz      │     1 │
│     5 │ swimmeth   │     1 │
│     6 │ fidelity   │     1 │
│     7 │ jeziah     │     1 │
│     8 │ savours    │     1 │
│     9 │ ashvath    │     1 │
│   ... │ ...        │ ...   │
│ 13019 │ all        │  5637 │
│ 13020 │ them       │  6430 │
│ 13021 │ not        │  6624 │
│ 13022 │ him        │  6659 │
│ 13023 │ they       │  7378 │
│ 13024 │ lord       │  7964 │
│ 13025 │ his        │  8473 │
│ 13026 │ unto       │  8997 │
│ 13027 │ shall      │  9840 │
│ 13028 │ he         │ 10422 │
╰───────┴────────────┴───────╯

And it took less than a second.

To be specific: 444ms 413µs 290ns.

Doing the same in “vanilla” Nushell like so:

def biblenu [] {
    let corp = open --raw /home/lk/Data/king-james-bible.txt | str downcase | split words | wrap corp
    let stop = [i a an and are as at be by for from how in is it of on or that the this to was what when where who will with the] | wrap stop
    let tidy = $corp | where corp in $stop.stop == false
    let freq = $tidy | uniq --count
    let sort = $freq | sort-by count
  $sort | flatten
}

gives

│ 13019 │ all   │  5637 │
│ 13020 │ them  │  6430 │
│ 13021 │ not   │  6624 │
│ 13022 │ him   │  6659 │
│ 13023 │ they  │  7378 │
│ 13024 │ lord  │  7964 │
│ 13025 │ his   │  8473 │
│ 13026 │ unto  │  8997 │
│ 13027 │ shall │  9840 │
│ 13028 │ he    │ 10422 │
├───────┼───────┼───────┤
│     # │ corp  │ count │
╰───────┴───────┴───────╯

So the same result.

But this time it took: 4sec 118ms 703µs 223ns

So Nushell with Polars is around 10x faster for this kind of task than Nushell without Polars.