Diego Cabello

<<<Back to Coding

ENTS (Extendable Nested Tagging Schema)

Date:

Words: 1174

Draft: < 2 > · Most recent

Github

One of the best pieces of advice I’ve ever gotten was “follow whatever you are interested in”.

About a year ago I started using Obsidian but I was bothered by the limited tagging systems in it and how there were no nested tags - there were add-ons for it but it wasn’t native and I came to find I preferred more minimalistic CLI/UNIX inspired solutions anyways. Implications for researchers who have to comb through a lot of local files trying to find connections between disparate fields. Common file systems including (EXT-4, NTFS, exFAT, APFS) treat tags as flat labels, but this limits hierarchical tag organization and nuanced queries. So, I made the Extendable Nested Tagging Schema, or ENTS.

ENTS is an implementable YAML and command schema. When implemented, the “S” in “ENTS” stands for “System”. Currently it comes in two implementations, MENTS (Metal ENTS) and LENTS (Local ENTS).

ENTS Schema


Tags Format in YAML

Example:

Organic Chemistry:
  - Medicinal Chemistry
  - Polymer Chemistry:
      - Plastic Materials
      - Carbon Nanomaterials:
          - Carbon Nanotubes
          - Graphene
  - Petrochemistry
Inorganic Chemistry:
  - Coordination Chemistry
  - Organometallic Chemistry

Commands

A generic xents command will be used here as an example implementation.

Usage: xents <command> <operator> <arguments>

ENTS commands use inverse function mapping - one to many, not many to one. This is chosen so the first argument of one type can have many arguments (that are of a different type than the first and are all of a similar type) after it

process

Takes a YAML file and turns it into a .lent file

  • Usage: xents process file.yaml
  • no additional operators or arguments

tagtofiles

Usage: xents tagtofiles <add|remove|show> <arguments>*

Description: Manages relationships between one tag and many files

Alias: ttf

  • add
    • Usage: xents tagtofiles add <tag> <files>*
    • Description: creates links from one tag to \ge 1 files
  • remove
    • Usage: xents tagtofiles remove <tag> <files>*
    • Description: removes a link one from one tag to \ge 1 files
    • Alias: rm
  • show
    • Usage: xents tagtofiles show <tags>*
    • Description: shows all files linked to each tag inputted. can take more than one tag as input

filetotags

Usage: xents filetotags <add|remove> <arguments>*

Description: Manages relationships between one file and many tags

Alias: ftt

  • add
    • Usage: xents filetotags add <file> <tags>*
    • Description: creates links from one file to \ge 1 tags
  • remove
    • Usage: xents filetotags remove <file> <tags>*
    • Description: removes links from one file to \ge 1 tags
    • Alias: rm
  • show
    • Usage: xents filetotags show <files>*
    • Description: shows all tags linked to each file inputted. can take more than one file as input

filter

Usage: xents filter [-flags] <tags>*

Description: Returns files associated with tags. The main command of MENTS, where the nesting comes into play. Differentiated from the tagtofiles command, where the nesting does not come into play.

  • Nested:
    • Usage: xents filter <tags>
    • Description: the default mode.
    • Example: xents filter "organic chemistry" will return everything that is tagged with something within organic chemistry. If a file is tagged only with “carbon nanotubes”, it will be returned because “carbon nanotubes” is a sub-tag in organic chemistry
  • Explicit -e
    • Usage: xents filter -e <tags>
    • Description: will only return a file if it is explicitly tagged with it
    • Example: xents filter -e "organic chemistry" will return a file marked with “organic chemistry,carbon nanotubes” but not one with just “carbon nanotubes”.
    • Note: Uses the same implementation as xents tagtofiles show <files>* and is included here with -e tags for mnemonic purposes
  • Traverse Down by Count -td
    • Usage: xents filter -td <count> <tags>
    • Description: Returns files only if explicitly tagged with a tag a certain number of levels removed from the query tag.
    • Example: xents filter -td 2 "organic chemistry" will return everything explicitly tagged with any tag between 0 and 2 layers down from “organic chemistry”. In this example, files tagged with “carbon nanomaterials” will be returned but files tagged with only “carbon nanotubes” will not.
    • Note: Can be combined with Traversal Down by Increment -tu <count> -td <count>. For example xents -tu 1 -td 2 will traverse one node up the tree then return everything recursively down two layers from that.
  • Traverse Up by Count -tu
    • Usage: xents filter -tu <count>
    • Description: Recursive by default.
    • Example: xents -tu 1 "organic chemistry" will traverse one node up the tag tree then return everything recursively down from that
    • Note: Can be combined with Traversal Up by Increment -td <count> -tu <count>

MENTS (Metal ENTS)


MENTS is a close-to-the-metal implementation of ENTS written in C with a custom binary file format. It is the most hackerish version of ENTS, has the smallest database size, and was made as a low-level learning project. It is not very extendable and the amount of tags possible is limited, so LENTS is also an option.

The MENTS file format

  1. Terminology and stylistic choices
    • Some set theory will be used in combination with colloquial English because it is the best way to communicate some things clearly and unambiguously
    • Hex-line: refers to a standard 16-byte row in a hex dump
    • Metadata: is a common term meaning “data about the data” used here.
    • Ipsodata: contrasted with metadata. “Data that is the data itself”. Commonly used English terms for “the data itself” such as {“core data”, “primary data”, “source data”, “raw data”} do not capture what I want to mean here, the data itself which is opposed to the data-about-the-data. And so the Latin root “ipso”, meaning “itself”, is used as a prefix to “data”.
    • Offset: When offset is said here, I am referring to only the length of the offset and not the content of the offset.
  2. Over-arching design choices for MENTS
    • MENTS is made for a nested tagging system for tree tag hierarchies. Each tag has a parent (except for the top level tags) and can have children.
    • MENTS is designed for a multi-user session many simultaneous query database
    • The file format is designed for human-readability in hex dump
    • No existing solutions accomplish any of these points
  3. MENTS is designed for quick lookup for children tags for a multi-user-session database. This means hash table lookup (O(1) complexity) as opposed to tree traversal or linear traversal (O(N) complexity).
  4. File Header (32 Bytes, or 2 hex-lines)
    1. MENTS magic number in unicode
    2. Version number
    3. Tags quantity denoter
  5. Tags Metadata (4a)
    1. Tags Metadata = {Tag metadata individual length indicator, Tag meta/ipsodata individual offset indicator, Tag name}

LENTS (Local ENTS) - Planned Version


LENTS (Local ENTS) is an implementation of ENTS written with individual researchers and small research teams in mind. It will likely be a wrapper around Refsplitr and PartiQL/DynamoDB Local, which allows for more customizable queries.

  • Refsplitr - “refsplitr is a package designed to assist researchers dealing with bibliometric data by providing tools for author name disambiguation, author georeferencing, and coauthorship network mapping using data from the Web of Science”1
    • Author disambiguation
    • Support for multiple authors per file
    • Author/Institution timeline mapping
    • Institution disambiguation
  • PartiQL - “An expressive, SQL-compatible query language giving access to relational, semi-structured, and nested data.”2
    • allows for more custom queries, including queries with the additional information above
  • Potential Implementations: Metadata Extraction Tools
    • ExifTool by Phil Harvey - “ExifTool is a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files.”
    • Apache Tika - “The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.”
  • Visualization tools

  1. Stevens, Forrest. “Refsplitr: Author Name Disambiguation, Author Georeferencing, and Mapping of Coauthorship Networks with Web of Science Data.” Journal of Open Source Software, n.d.↩︎

  2. https://partiql.org/index.html↩︎

<<<Back to Coding

Made with Sculblog