Lexer and tokenizer for robots.txt in JavaScript

- January 13, 2020

Introduction I'm sharing how i made lexer and tokenizer for robots.txt in JavaScript

Reasons and motives I'm using it in the Nodejs environment of course so it's server side it's part of another project I'm doing.

How it's done robots.txt files contain one directive per line inform of

Key: value
Take for instance

User-agent: Googlebot/{version}
Disallow: {path}

Path the disallowed location it may included wildcards ie "*" or "/" its not a good idea to consume them raw without validation

And version number is ignored to make targeting bots easy i guess

Steps taken by the lexer
Splitting text into an array containing data from the robots.txt file line by line given it's provided with an input looking like this

User-agent: Googlebot/{version}
Disallow: {path}

It may return something that's like this

[User-agent: Googlebot/{version},
Disallow: {path}]

Almost there. after that next step is to remove comments from each line using regEx

Afterwards you should remove whitespace as well as splitting text into key values

Example output


┌─────────┬──────────────┬─────────────────────────────────────────────┐
│ (index) │     type     │                    value                    │
├─────────┼──────────────┼─────────────────────────────────────────────┤
│    0    │ 'user-agent' │                   'slurp'                   │
│    1    │  'disallow'  │                     ''                      │
│    2    │ 'user-agent' │                 'seznambot'                 │
│    3    │  'disallow'  │                     ''                      │
│    4    │ 'user-agent' │           'mediapartners-google'            │
│    5    │  'disallow'  │                     ''                      │
│    6    │ 'user-agent' │                 'naverbot'                  │
│    7    │  'disallow'  │                     ''                      │
│    8    │ 'user-agent' │                  'msnbot'                   │
│    9    │  'disallow'  │                     ''                      │
│   10    │ 'user-agent' │                 'googlebot'                 │
│   11    │  'disallow'  │                     ''                      │
│   12    │ 'user-agent' │                  'bingbot'                  │
│   13    │  'disallow'  │                     ''                      │
│   14    │ 'user-agent' │                'baiduspider'                │
│   15    │ 'user-agent' │                     '*'                     │
│   16    │  'disallow'  │                 "'/food/'"                  │
│   17    │  'sitemap'   │ 'http://swanlivesvblogsxot.com/sitemap.xml' │
└─────────┴──────────────┴─────────────────────────────────────────────┘

Source code

Search This Blog

John swana

Lexer and tokenizer for robots.txt in JavaScript

Comments

Post a Comment

Popular posts from this blog

Zedplug will be decommissioned on November 30th

How to make a html5 breakout game in JavaScript

FlappingBird Postmortem (JS13K)

How to get user's ip address in JavaScript

How to make a Snake Game using JavaScript and html5