Lexer and tokenizer for robots.txt in JavaScript

Introduction I'm sharing how i made lexer and tokenizer for robots.txt in JavaScript

Reasons and motives I'm using it in the Nodejs environment of course so it's server side it's part of another project I'm doing.

How it's done robots.txt files contain one directive per line inform of

Key: value
Take for instance

User-agent: Googlebot/{version}
Disallow: {path}

Path the disallowed location it may included wildcards ie "*" or "/" its not a good idea to consume them raw without validation

And version number is ignored to make targeting bots easy i guess

Steps taken by the lexer
Splitting text into an array containing data from the robots.txt file line by line given it's provided with an input looking like this

User-agent: Googlebot/{version}
Disallow: {path}

It may return something that's like this
[User-agent: Googlebot/{version},
Disallow: {path}]

Almost there. after that next step is to remove comments from each line using regEx

Afterwards you should remove whitespace as well as splitting text into key values

Example output


┌─────────┬──────────────┬─────────────────────────────────────────────┐
│ (index) │     type     │                    value                    │
├─────────┼──────────────┼─────────────────────────────────────────────┤
│    0    │ 'user-agent' │                   'slurp'                   │
│    1    │  'disallow'  │                     ''                      │
│    2    │ 'user-agent' │                 'seznambot'                 │
│    3    │  'disallow'  │                     ''                      │
│    4    │ 'user-agent' │           'mediapartners-google'            │
│    5    │  'disallow'  │                     ''                      │
│    6    │ 'user-agent' │                 'naverbot'                  │
│    7    │  'disallow'  │                     ''                      │
│    8    │ 'user-agent' │                  'msnbot'                   │
│    9    │  'disallow'  │                     ''                      │
│   10    │ 'user-agent' │                 'googlebot'                 │
│   11    │  'disallow'  │                     ''                      │
│   12    │ 'user-agent' │                  'bingbot'                  │
│   13    │  'disallow'  │                     ''                      │
│   14    │ 'user-agent' │                'baiduspider'                │
│   15    │ 'user-agent' │                     '*'                     │
│   16    │  'disallow'  │                 "'/food/'"                  │
│   17    │  'sitemap'   │ 'http://swanlivesvblogsxot.com/sitemap.xml' │
└─────────┴──────────────┴─────────────────────────────────────────────┘

Source code

Comments