- Blog/
Analysing the entire Holy Bible in less than a second
Word frequency analysis is important in quantitative text analysis.
Nushell has a plugin for Polars support.
When installed, you can analyse a large corpus of text very fast:
def bible [] {
let corp = open --raw /home/lk/Data/king-james-bible.txt | str downcase | split words | polars into-df
let stop = [i a an and are as at be by for from how in is it of on or that the this to was what when where who will with the] | polars into-df
let mask = $corp | polars is-in $stop
let tidy = $corp | polars filter-with ($mask | polars not)
let freq = $tidy | polars value-counts
let sort = $freq | polars sort-by count
$sort
}
Giving:
╭───────┬────────────┬───────╮
│ # │ 0 │ count │
├───────┼────────────┼───────┤
│ 0 │ endow │ 1 │
│ 1 │ clappeth │ 1 │
│ 2 │ elishaphat │ 1 │
│ 3 │ muse │ 1 │
│ 4 │ makaz │ 1 │
│ 5 │ swimmeth │ 1 │
│ 6 │ fidelity │ 1 │
│ 7 │ jeziah │ 1 │
│ 8 │ savours │ 1 │
│ 9 │ ashvath │ 1 │
│ ... │ ... │ ... │
│ 13019 │ all │ 5637 │
│ 13020 │ them │ 6430 │
│ 13021 │ not │ 6624 │
│ 13022 │ him │ 6659 │
│ 13023 │ they │ 7378 │
│ 13024 │ lord │ 7964 │
│ 13025 │ his │ 8473 │
│ 13026 │ unto │ 8997 │
│ 13027 │ shall │ 9840 │
│ 13028 │ he │ 10422 │
╰───────┴────────────┴───────╯
And it took less than a second.
To be specific: 444ms 413µs 290ns.
Doing the same in “vanilla” Nushell like so:
def biblenu [] {
let corp = open --raw /home/lk/Data/king-james-bible.txt | str downcase | split words | wrap corp
let stop = [i a an and are as at be by for from how in is it of on or that the this to was what when where who will with the] | wrap stop
let tidy = $corp | where corp in $stop.stop == false
let freq = $tidy | uniq --count
let sort = $freq | sort-by count
$sort | flatten
}
gives
│ 13019 │ all │ 5637 │
│ 13020 │ them │ 6430 │
│ 13021 │ not │ 6624 │
│ 13022 │ him │ 6659 │
│ 13023 │ they │ 7378 │
│ 13024 │ lord │ 7964 │
│ 13025 │ his │ 8473 │
│ 13026 │ unto │ 8997 │
│ 13027 │ shall │ 9840 │
│ 13028 │ he │ 10422 │
├───────┼───────┼───────┤
│ # │ corp │ count │
╰───────┴───────┴───────╯
So the same result.
But this time it took: 4sec 118ms 703µs 223ns
So Nushell with Polars is around 10x faster for this kind of task than Nushell without Polars.