Diego Cabello

<<<Back to Coding

ENTS (Extendable Nested Tagging Schema)

Date:

Words: 1339

Draft: < 3 (Most recent)

Github

One of the best pieces of advice I’ve ever gotten was “follow whatever you are interested in”.

About a year ago I started using Obsidian but I was bothered by the limited tagging systems in it and how there were no nested tags - there were add-ons for it but it wasn’t native and I came to find I preferred more minimalistic CLI/UNIX inspired solutions anyways. Implications for researchers who have to comb through a lot of local files trying to find connections between disparate fields. Common file systems including (EXT-4, NTFS, exFAT, APFS) treat tags as flat labels, but this limits hierarchical tag organization and nuanced queries. So, I made the Extendable Nested Tagging Schema, or ENTS.

ENTS is an implementable YAML and command schema. When implemented, the “S” in “ENTS” stands for “System”. Currently it comes in two implementations, MENTS (Metal ENTS) and LENTS (Local ENTS).

ENTS Schema


Tags Format in YAML

Example:

Organic Chemistry:
  - Medicinal Chemistry
  - Polymer Chemistry:
      - Plastic Materials
      - Carbon Nanomaterials:
          - Carbon Nanotubes
          - Graphene
  - Petrochemistry
Inorganic Chemistry:
  - Coordination Chemistry
  - Organometallic Chemistry

Commands

A generic xents command will be used here as an example implementation.

Usage: xents <command> <operator> <arguments>

ENTS commands use inverse function mapping - one to many, not many to one. This is chosen so the first argument of one type can have many arguments (that are of a different type than the first and are all of a similar type) after it

process

Takes a YAML file and processes it to a datafile. Output varies on imlementation.

  • Usage: xents process file.yaml
  • Notes: no additional operators or arguments
  • Alias: parse

tagtofiles

Usage: xents tagtofiles <add|remove|show> <arguments>*

Description: Manages relationships between one tag and many files

Alias: ttf

  • add
    • Usage: xents tagtofiles add <tag> <files>*
    • Description: creates links from one tag to \ge 1 files
    • Alias: assign
  • remove
    • Usage: xents tagtofiles remove <tag> <files>*
    • Description: removes a link one from one tag to \ge 1 files
    • Alias: rm
  • show
    • Usage: xents tagtofiles show <tags>*
    • Description: shows all files linked to each tag inputted. can take more than one tag as input

filetotags

Usage: xents filetotags <add|remove> <arguments>*

Description: Manages relationships between one file and many tags

Alias: ftt

  • add
    • Usage: xents filetotags add <file> <tags>*
    • Description: creates links from one file to \ge 1 tags
    • Alias: assign
  • remove
    • Usage: xents filetotags remove <file> <tags>*
    • Description: removes links from one file to \ge 1 tags
    • Alias: rm
  • show
    • Usage: xents filetotags show <files>*
    • Description: shows all tags linked to each file inputted. can take more than one file as input

filter

Usage: xents filter [-flags] <tags>*

Description: Returns files associated with tags. The main command of MENTS, where the nesting comes into play. Differentiated from the tagtofiles command, where the nesting does not come into play.

  • Nested:
    • Usage: xents filter <tags>
    • Description: the default mode.
    • Example: xents filter "organic chemistry" will return everything that is tagged with something within organic chemistry. If a file is tagged only with “carbon nanotubes”, it will be returned because “carbon nanotubes” is a sub-tag in organic chemistry
  • Explicit -e
    • Usage: xents filter -e <tags>
    • Description: will only return a file if it is explicitly tagged with it
    • Example: xents filter -e "organic chemistry" will return a file marked with “organic chemistry,carbon nanotubes” but not one with just “carbon nanotubes”.
    • Note: Uses the same implementation as xents tagtofiles show <files>* and is included here with -e tags for mnemonic purposes
  • Traverse Down by Count -td
    • Usage: xents filter -td <count> <tags>
    • Description: Returns files only if explicitly tagged with a tag a certain number of levels removed from the query tag.
    • Example: xents filter -td 2 "organic chemistry" will return everything explicitly tagged with any tag between 0 and 2 layers down from “organic chemistry”. In this example, files tagged with “carbon nanomaterials” will be returned but files tagged with only “carbon nanotubes” will not.
    • Note: Can be combined with Traversal Down by Increment -tu <count> -td <count>. For example xents -tu 1 -td 2 will traverse one node up the tree then return everything recursively down two layers from that.
  • Traverse Up by Count -tu
    • Usage: xents filter -tu <count>
    • Description: Recursive by default.
    • Example: xents -tu 1 "organic chemistry" will traverse one node up the tag tree then return everything recursively down from that
    • Note: Can be combined with Traversal Up by Increment -td <count> -tu <count>

MENTS (Metal ENTS)


MENTS is a close-to-the-metal implementation of ENTS written in C with a custom binary file format. It is the most hackerish version of ENTS, has the smallest database size, and was made as a low-level learning project. It is not very extendable and the amount of tags possible is limited, so LENTS is also an option.

MENTS usage

process

The output for ments process tags.yaml will always be to a .ments hidden dot-file. This output cannot be changed, for the sake of parsing.

The MENTS file format

  1. Terminology and stylistic choices
    • Some set theory will be used in combination with colloquial English because it is the best way to communicate some things clearly and unambiguously
    • Hex-line: refers to a standard 16-byte row in a hex dump
    • Metadata: is a common term meaning “data about the data” used here.
    • Ipsodata: contrasted with metadata. “Data that is the data itself”. Commonly used English terms for “the data itself” such as “core data”, “primary data”, “source data”, and “raw data” do not capture what I want to mean here, the data itself which is opposed to the data-about-the-data. Uses the Latin root “ipso”, meaning “itself” as a prefix.
    • Offset: When offset is said here, I am referring to only the length of the offset and not the content of the offset.
  2. Over-arching design choices for MENTS
    • MENTS is a directed acyclic graph structure where hierarchically organized tags (nodes with parent-child relationships) map to files (destination nodes)
    • The file format is designed for human-readability in hex dump
    • The file format is structured a two-level index to two offset lookup tables (tags & file->tags)
  3. File Header (32 Bytes, or 2 hex-lines)
    1. MENTS magic number in unicode
    2. Version number
    3. Offsets
      1. Start of tags ipsodata
      2. Start of file-to-tags metadata
      3. Start of file-to-tags ipsodata
  4. Tags Data

Each tag ipsodata can vary significantly in length, so an offset table is used to save lots of file space at minimal expense of lookup time. All the attributes in the ipsodata can also vary in length from eachother, so a “tag individual ipsodata part length” field is needed in the metadata. Tag ipsodata offset is the distance measured from (3.3.a). The tag UUID is not neccassary for the tag ipsodata, but is included for debug purposes.

  1. Tags Metadata = {\forall Tag Metadata}
    • Tag Metadata = {Tag UUID, Tag individual ipsodata part length, Tag ipsodata offset}
  2. Tags Ipsodata = {\forall Tag Ipsodata}
    • Tag Ipsodata = {Tag UUID, Tag name, Tag Ancestry UUIDs, Tag Children UUIDs}
  1. File-to-tags Relationship Data

The same heuristics about length variation and UUID-inclusion from (4) apply here. File-to-tags metadata starts at (3.3.b) and file ipsodata offset is the distance measured from (3.3.c).

  1. Files Metadata = {\forall File Metadata}
    • File Metadata = {File UUID, File individual ipsodata part length, File ipsodata offset}
  2. Files Ipsodata = {\forall File Ipsodata}
    • File Ipsodata = {File UUID, Tags UUIDs}

LENTS (Local ENTS) - Planned Version


LENTS (Local ENTS) is an implementation of ENTS written with individual researchers and small research teams in mind. It will likely be a wrapper around Refsplitr and PartiQL/DynamoDB Local, which allows for more customizable queries.

  • Refsplitr - “refsplitr is a package designed to assist researchers dealing with bibliometric data by providing tools for author name disambiguation, author georeferencing, and coauthorship network mapping using data from the Web of Science”1
    • Author disambiguation
    • Support for multiple authors per file
    • Author/Institution timeline mapping
    • Institution disambiguation
  • PartiQL - “An expressive, SQL-compatible query language giving access to relational, semi-structured, and nested data.”[^2]
    • allows for more custom queries, including queries with the additional information above
  • Potential Implementations: Metadata Extraction Tools
    • ExifTool by Phil Harvey - “ExifTool is a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files.”
    • Apache Tika - “The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.”
  • Visualization tools

  1. Stevens, Forrest. “Refsplitr: Author Name Disambiguation, Author Georeferencing, and Mapping of Coauthorship Networks with Web of Science Data.” Journal of Open Source Software, n.d.↩︎

<<<Back to Coding

Made with Sculblog