Inspecting yaml frontmatter in markdown files with Nushell

Markdown is widely used when working with text based generative artificial intelligence.

YAML is suitable for more structured data.

The two can be combined as follows.

---
title: Blog about Nushell and structured data
date: 2024-09-19
tags:
  - Nushell
  - data
---
 
The body text goes here.

Imagine the above text is in a file called blog-1.md

The part between the --- lines is the frontmatter in YAML. Everything below the second --- line is the markdown content.

Then in Nushell I create a small custom command like so:

def frontmatter [] {
  lines |
  split list '---' |
  do { |lst|
    let fm = $lst.0 |
      to text | from yaml |
      into value | into record
    {
      frontmatter: $fm
      content: ($lst | skip 1 | to text | str trim)
    }
  } $in
}

With this I can do:

open blog-1.md | frontmatter

And I get

╭─────────────┬────────────────────────────────────────────────────╮
│             │ ╭───────┬────────────────────────────────────────╮ │
│ frontmatter │ │ title │ Blog about Nushell and structured data │ │
│             │ │ date  │ 11 hours ago                           │ │
│             │ │       │ ╭───┬─────────╮                        │ │
│             │ │ tags  │ │ 0 │ Nushell │                        │ │
│             │ │       │ │ 1 │ data    │                        │ │
│             │ │       │ ╰───┴─────────╯                        │ │
│             │ ╰───────┴────────────────────────────────────────╯ │
│ content     │ The body text goes here.                           │
╰─────────────┴────────────────────────────────────────────────────╯

This can be used in other pipelines, eg. when publishing

let body = open $path | frontmatter | get content
let meta = open $path | frontmatter | get frontmatter

But also for finding, for example, posts that are not yet published:

open *.md | each { frontmatter } | get frontmatter | where draft == true

and slightly more complicated things like finding the most frequently used tags:

glob **/*.md | par-each { open $in | frontmatter | get frontmatter.tools? } | flatten | uniq -c | sort-by count

giving

╭────┬───────────┬───────╮
│  # │   value   │ count │
├────┼───────────┼───────┤
│  0 │ latex     │     1 │
│  1 │ fontforge │     1 │
│  2 │ groff     │     1 │
│  3 │ LLM       │     1 │
│  4 │ Nushell   │     1 │
│  5 │ markdown  │     1 │
│  6 │ Marvin    │     1 │
│  7 │ drupal    │     1 │
│  8 │ pandoc    │     2 │
│  9 │ llm       │     2 │
│ 10 │ nvim      │     3 │
│ 11 │ aichat    │     3 │
│ 12 │ ChatGPT   │     4 │
│ 13 │ Nushell   │    13 │
╰────┴───────────┴───────╯

Note the ? after .tools in the above command. This is because not all my markdown files has a tools key in the frontmatter. When ? is used, files without the tools key will just be ignored instead of breaking the pipeline.

Much, much more can be done with this of course. Not least combining it with LLMs.

Final note:

Doing

glob **/*.md | par-each { open $in | frontmatter }

is more robust than

open **/*.md | par-each { frontmatter }

when some files have lots of weird characters in them.