Saturday, March 29, 2008

String Tokenizer for Javascript


This small class can easily parse a string, and generate different kind of tokens. It's very simple and straight-forward. It can perform as a base for other string parsing scripts, like templating engines, custom language interpreters, and many more.

jQuery plugin vs standalone

When called, the script will generate the class, and if jQuery is detected, it will be saved at $.tokenizer.
Otherwise, the class is saved at (window.)Tokenizer.
Note that this script doesn't need jQuery at all, this option is added to ease on jQuery developers.

How to use

The constructor of the class takes 2 arguments, 1 is optional.
  • tokenizers
    This is a collection of strings/regexes that match the tokens.
    The Regexes don't need to include back-references, they can though, but the whole match will be considered a token.
    If you use regex, it's important that you DON'T make it global.
    You can send an array of tokenizers, or just one.
  • build
    This is a parsing function, it will get called for each token found, and also for the string between tokens. It should return the parsed token, note this doesn't need to be a string, the returned token can be an array, an object, etc.
    If no function is given, the tokens are the matched strings.
    The function receives 3 arguments:
    1. The string token that was matched.
    2. Whether it is a matched token, or the string between 2 tokens (true means real token, false, plain string).
    3. The tokenizer that matched this string, or the one that skipped over this slice in the case of plain strings.
As mentioned, build won't just get called for each token found, but also for the strings between tokens. Use the second argument to know which one it is. After you create the tokenizer, you call the method .parse() passing the string, and it will return the array of tokens. You might want to actually do what you need, inside the build method, and just ignore the returned array.



var values = { name:'Joe', age:32, surname:'Smith' };
var tokenizer = new Tokenizer([
    /<%(\w+)%>/, /\$(\w+)/
 ],function( src, real, re ){
    return real ? src.replace(re,function(all,name){
       return values[name];
    }) : src;
var tpl = '<%name%> $surname is $age years old.';
var tokens = tokenizer.parse(tpl);
document.body.innerHTML = tokens.join('');
CSV parser

var rows = [ ], row = rows[0] = [ ]; 
var csv = new Tokenizer( [',',';'],
  function( text, isSeparator ){
     if( isSeparator ){
         if( text == ';' ){//new row
             row = [ ];



Richard D. Worth said...

Ariel. This is great! It's going on my short list. Thanks for sharing.

dbruensicke said...

do you mean CSV instead of CVS?

Ariel Flesler said...

What was I thinking ? Thanks for catching that up, I made a mistake once, and then repeated over and over, will fix it now.

Thanks again.

Miguel Ruiz Velasco S said...

The firebug complains on onEnd not defined, and reading the code
return new Tokenizer( tokenizers, onEnd, onFound );
in the above code, onEnd and onFound are not defined, changing that to doBuild, makes it work

Ariel Flesler said...

Right thanks for spotting. That remained from a change in the last release.
Only happens when called without 'new'.
I just fixed on the trunk, will be in for a next release.
Thanks again

olivier said...

very interesting, though curiously the first example doesn't work for me, returning :

Joe Smith is 32undefined years old.

But, removing 'years old.' from tpl, or adding a <% %> at the end as follows:

var values = { firstname:'Joe', age:'32', surname:'Smith', fin:'' };
var tpl = 'guy <%firstname%> <%surname%> is <%age%> years old.<%fin%>';

makes it work properly

any hint ?

Ariel Flesler said...

Ok, fixed the demo. Thanks for noticing.