Lexer and tokenizer for robots.txt in JavaScript
Introduction I'm sharing how i made lexer and tokenizer for robots.txt in JavaScript
Reasons and motives I'm using it in the Nodejs environment of course so it's server side it's part of another project I'm doing.
How it's done robots.txt files contain one directive per line inform of
Key: value
Take for instance
Take for instance
User-agent: Googlebot/{version}
Disallow: {path}
Disallow: {path}
Path the disallowed location it may included wildcards ie "*" or "/" its not a good idea to consume them raw without validation
And version number is ignored to make targeting bots easy i guess
Steps taken by the lexer
Splitting text into an array containing data from the robots.txt file line by line given it's provided with an input looking like this
User-agent: Googlebot/{version}
Disallow: {path}
Disallow: {path}
It may return something that's like this
[User-agent: Googlebot/{version},
Disallow: {path}]
Disallow: {path}]
Almost there. after that next step is to remove comments from each line using regEx
Afterwards you should remove whitespace as well as splitting text into key values
Example output
┌─────────┬──────────────┬─────────────────────────────────────────────┐
│ (index) │ type │ value │
├─────────┼──────────────┼─────────────────────────────────────────────┤
│ 0 │ 'user-agent' │ 'slurp' │
│ 1 │ 'disallow' │ '' │
│ 2 │ 'user-agent' │ 'seznambot' │
│ 3 │ 'disallow' │ '' │
│ 4 │ 'user-agent' │ 'mediapartners-google' │
│ 5 │ 'disallow' │ '' │
│ 6 │ 'user-agent' │ 'naverbot' │
│ 7 │ 'disallow' │ '' │
│ 8 │ 'user-agent' │ 'msnbot' │
│ 9 │ 'disallow' │ '' │
│ 10 │ 'user-agent' │ 'googlebot' │
│ 11 │ 'disallow' │ '' │
│ 12 │ 'user-agent' │ 'bingbot' │
│ 13 │ 'disallow' │ '' │
│ 14 │ 'user-agent' │ 'baiduspider' │
│ 15 │ 'user-agent' │ '*' │
│ 16 │ 'disallow' │ "'/food/'" │
│ 17 │ 'sitemap' │ 'http://swanlivesvblogsxot.com/sitemap.xml' │
└─────────┴──────────────┴─────────────────────────────────────────────┘
Source code
Comments
Post a Comment