Diego Cabello

<<<Back to Coding

ENTS (Extendable Nested Tagging Schema)

Date: 2025 Apr 11

Words: 1244

Draft: 1 > · Most recent

One of the best pieces of advice I’ve ever gotten was “follow whatever you are interested in”.

About a year ago I started using Obsidian but I was bothered by the limited tagging systems in it and how there were no nested tags - there were add-ons for it but it wasn’t native and I came to find I preferred more minimalistic CLI/UNIX inspired solutions anyways. Since that time I became more familiar with the ways information could be stored in bits and text file formats, and read some of the papers on information theory (Claude Shannon). Yet, information tagging stuck to me like a thorn in my side, I felt it was very unexplored area that had particularly strong implications for researchers who have to comb through a lot of local files. Most systems treat tags as flat labels, but this limits hierarchical organization and nuanced queries. So, I made the Draft Extendable Nested Tagging System, or DENTS.

DENTS is a system and program that uses original .dents file extension and includes features from ext4 file tagging and SQL commands. DENTS’s main innovations in file tagging come in its nested tagging system with the .dents format. When DENTS runs, it indexes against a text file for nested categories and uses those categories to filter through files. By default it scans for the .dents file in the current working directory, but if specified with a flag it can run with a .dents file in another directory. .dents files are in YAML format. This nested tagging system is supremely useful for people doing research in a few specialized fields and want to find disparate connections between fields in their paper collections.

Tags Format in YAML

Example:

Organic Chemistry:
  - Medicinal Chemistry
  - Polymer Chemistry:
      - Plastic Materials
      - Carbon Nanomaterials:
          - Carbon Nanotubes
          - Graphene
  - Petrochemistry
Inorganic Chemistry:
  - Coordination Chemistry
  - Organometallic Chemistry

Commands

Usage: dents <command> <operator> <arguments>

DENTS commands use inverse function mapping - one to many, not many to one. This is chosen so the first argument of one type can have many arguments (that are of a different type than the first and are all of a similar type) after it

process

Takes a YAML file and turns it into a .lent file

  • Usage: dents process file.yaml
  • no additional operators or arguments

tagtofiles

Usage: dents tagtofiles <add|remove|show> <arguments>*

Description: Manages relationships between one tag and many files

Alias: ttf

  • add
    • Usage: dents tagtofiles add <tag> <files>*
    • Description: creates links from one tag to \ge 1 files
  • remove
    • Usage: dents tagtofiles remove <tag> <files>*
    • Description: removes a link one from one tag to \ge 1 files
    • Alias: rm
  • show
    • Usage: dents tagtofiles show <tags>*
    • Description: shows all files linked to each tag inputted. can take more than one tag as input

filetotags

Usage: dents filetotags <add|remove> <arguments>*

Description: Manages relationships between one file and many tags

Alias: ftt

  • add
    • Usage: dents filetotags add <file> <tags>*
    • Description: creates links from one file to \ge 1 tags
  • remove
    • Usage: dents filetotags remove <file> <tags>*
    • Description: removes links from one file to \ge 1 tags
    • Alias: rm
  • show
    • Usage: dents filetotags show <files>*
    • Description: shows all tags linked to each file inputted. can take more than one file as input

filter

Usage: dents filter [-flags] <tags>*

Description: Returns files associated with tags. The main command of DENTS, where the nesting comes into play. Differentiated from the tagtofiles command, where the nesting does not come into play.

  • Nested:
    • Usage: dents filter <tags>
    • Description: the default mode.
    • Example: dents filter "organic chemistry" will return everything that is tagged with something within organic chemistry. If a file is tagged only with “carbon nanotubes”, it will be returned because “carbon nanotubes” is a sub-tag in organic chemistry
  • Explicit -e
    • Usage: dents filter -e <tags>
    • Description: will only return a file if it is explicitly tagged with it
    • Example: dents filter -e "organic chemistry" will return a file marked with “organic chemistry,carbon nanotubes” but not one with just “carbon nanotubes”.
    • Note: Uses the same implementation as dents tagtofiles show <files>* and is included here with -e tags for memnomic purposes
  • Traverse Down by Count -td
    • Usage: dents filter -td <count> <tags>
    • Description: Returns files only if explicitly tagged with a tag a certain number of levels removed from the query tag.
    • Example: dents filter -td 2 "organic chemistry" will return everything explicitly tagged with any tag between 0 and 2 layers down from “organic chemistry”. In this example, files tagged with “carbon nanomaterials” will be returned but files tagged with only “carbon nanotubes” will not.
    • Note: Can be combined with Traversal Down by Increment -tu <count> -td <count>. For example dents -tu 1 -td 2 will traverse one node up the tree then return everything recursively down two layers from that.
  • Traverse Up by Count -tu
    • Usage: dents filter -tu <count>
    • Description: Recursive by default.
    • Example: dents -tu 1 "organic chemistry" will traverse one node up the tag tree then return everything recursively down from that
    • Note: Can be combined with Traversal Up by Increment -td <count> -tu <count>

The DENTS file format

  1. Terminology and stylistic choices
    • Some set theory will be used in combination with colloquial English because it is the best way to communicate some things clearly and unambigoiusly
    • Hex-line: refers to a standard 16-byte row in a hex dump
    • Metadata: is a common term meaning “data about the data” used here.
    • Ipsodata: contrasted with metadata. “Data that is the data itself”. Commonly used English terms for “the data itself” such as {“core data”, “primary data”, “source data”, “raw data”} do not capture what I want to mean here, the data itself which is opposed to the data-about-the-data. And so the Latin root “ipso”, meaning “itself”, is used as a prefix to “data”.
    • Offset: When offset is said here, I am referring to only the length of the offset and not the content of the offset.
  2. Over-arching design choices for DENTS
    • DENTS is made for a nested tagging system for tree tag hiearchies. Each tag has a parent (except for the top level tags) and can have children.
    • DENTS is designed for a multi-user session many simultaneous query database
    • The file format is designed for human-readability in hex dump
    • No existing solutions accomplish any of these points
  3. DENTS is designed for quick lookup for children tags for a multi-user-session database. This means hash table lookup (O(1) complexity) as opposed to tree traversal or linear traversal (O(N) complexity).
  4. File Header (32 Bytes, or 2 hex-lines)
    1. DENTS magic number in unicode
    2. Version number
    3. Tags quantity denoter
  5. Tags Metadata (4a)
    1. Tags Metadata = {Tag metadata individual length indicator, Tag meta/ipsodata individual offset indicator, Tag name}

LENTS: Planned Version


DENTS uses a custom binary implementation, which is difficult to extend or write queries for. So after DENTS, I will make LENTS, (Local Extendable Nested Tagging System), which will likely be a wrapper around Refsplitr and PartiQL/DynamoDB Local, which allows for more customizable queries. LENTS will probably retain the same command schema as DENTS.

  • Refsplitr - “refsplitr is a package designed to assist researchers dealing with bibliometric data by providing tools for author name disambiguation, author georeferencing, and coauthorship network mapping using data from the Web of Science”1
    • Author disambiguation
    • Support for multiple authors per file
    • Author/Institution timeline mapping
    • Institution disambiguation
  • PartiQL - “An expressive, SQL-compatible query language giving access to relational, semi-structured, and nested data.”2
    • allows for more custom queries, including queries with the additional information above
  • Potential Implementations: Metadata Extraction Tools
    • ExifTool by Phil Harvey - “ExifTool is a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files.”
    • Apache Tika - “The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.”
  • Visualization tools

  1. Stevens, Forrest. “Refsplitr: Author Name Disambiguation, Author Georeferencing, and Mapping of Coauthorship Networks with Web of Science Data.” Journal of Open Source Software, n.d.↩︎

  2. https://partiql.org/index.html↩︎

<<<Back to Coding

Made with Sculblog