class HumanQL::QueryParser

Human friendly, lenient query parser. Parses an arbitrary input string and outputs an abstract syntax tree (AST), which uses ruby arrays as S-expressions.

Supported Syntax Summary

As per defaults. In the table below, input string variations are shown at left separated by ','; and output AST is shown on the right.

a                        --> 'a'
"a b c"                  --> [ :phrase, 'a', 'b', 'c' ]
a b c                    --> [ :and, 'a', 'b', 'c' ]
a OR b, a|b              --> [ :or, 'a', 'b' ]
a AND b, a&b             --> [ :and, 'a', 'b' ]
a b|c                    --> [ :and, 'a', [:or, 'b', 'c'] ]
(a b) OR (c d)           --> [ :or, [:and, 'a', 'b'], [:and, 'c', 'd'] ]
NOT expr, -expr          --> [ :not, expr ]
SCOPE:expr, SCOPE : expr --> [ 'SCOPE', expr ]

Where:

The AST output from parse may have various no-ops and redundancies. Run it through a TreeNormalizer to avoid seeing or needing to handle these cases.

Customization

The lexing and token matching patterns, as well as other attributes used in the parser may be adjusted via constructor options or attribute writer methods. Many of these attributes may either be String constants or Regex patterns supporting multiple values as needed. Some features may be disabled by setting these values to nil (e.g. match no tokens). While accessors are defined, internally the instance variables are accessed directly for speed. Tests show this is as fast as using constants (which would be harder to modify) and faster than reader method calls.

Implementation Notes

The parser implementation adapts the infix precedence handling and operator stack of the Shunting Yard Algorithm originally described by Edsger Dijkstra. Attributes default_op and precedence control the handling of explicit or implied infix operators.

Constants

DEFAULT_PRECEDENCE

Default precedence of supported operators.

NSP

String pattern for Unicode non-spaces

SP

String pattern for Unicode spaces

SPACES

Regex for 1-to-many Unicode spaces

Attributes

and_token[RW]

AND operator token pattern. Should match the entire token using the 'A' and '/z' syntax for beginning and end of string. Default: Pattern matching complete tokens 'AND', 'and', or '&'

default_op[RW]

The default operator when none is otherwise given between parsed terms. Default: :and

infix_token[RW]

Pattern used for lexing to treat certain punctuation characters as separate tokens, even if they are not space separated. Default: Pattern matching any characters '(', ')', '|', '&', '“' as used as operator/parenthesis tokens in the defaults below.

lparen[RW]

Left parentheses pattern or value Default: '('

lquote[RW]

Left quote pattern or value Default: '“'

not_token[RW]

NOT operator token pattern. Should match the entire token using the 'A' and '/z' syntax for beginning and end of string. Default: Pattern matching complete tokens 'NOT', 'not', or '-'

or_token[RW]

OR operator token pattern. Should match the entire token using the 'A' and '/z' syntax for beginning and end of string. Default: Pattern matching complete tokens 'OR', 'or', or '|'

precedence[RW]

Hash of operators to precedence Integer values. The hash should also provide a default value for unlisted operators like any supported scopes. To limit human surprise, the default_op should have the lowest precedence. The default is as per DEFAULT_PRECEDENCE with a default value of 10, thus :not has the highest precedence at 11.

prefix_token[RW]

Pattern used for lexing to treat certain characters as separate tokens when appearing as a prefix only. Default '-' (as used in default not_tokens)

rparen[RW]

Right parentheses pattern or value Default: ')'

rquote[RW]

Right quote pattern or value. Its fine if this is the same as lquote. Default: '“'

scope[RW]

Scope pattern or value matching post-normalized scope token, including trailing ':' but without whitespace. Default: nil -> no scopes

scope_token[RW]

SCOPE unary operator pattern used for lexing to treat a scope prefix, e.g. 'SCOPE' + ':', with or without internal or trailing whitespace as single token. Used by norm_scope, where it also treats a non-matching ':' as whitespace. This would normally be set via scopes=. Default: nil -> no scopes

scope_upcase[RW]

Should scope tokens be capitalized in the AST? This would imply case-insensitive scope, and scope_token as generated via scopes= with the `ignorecase: true` option. Default: false

spaces[RW]

Pattern matching one or more characters to treat as white-space. Default: SPACES

verbose[RW]

If true, log parsing progress and state to $stderr. Default: false

Public Class Methods

new(opts = {}) click to toggle source

Construct given options which are interpreted as attribute names to set.

# File lib/human-ql/query_parser.rb, line 199
def initialize(opts = {})
  @default_op = :and

  @precedence = Hash.new(10)
  @precedence.merge!(DEFAULT_PRECEDENCE)
  @precedence.freeze

  @spaces = SPACES
  @infix_token  = /[()|&"]/.freeze
  @prefix_token = /(?<=\A|#{SP})-(?=#{NSP})/.freeze
  @or_token  = /\A(OR|\|)\z/i.freeze
  @and_token = /\A(AND|\&)\z/i.freeze
  @not_token = /\A(NOT|\-)\z/i.freeze
  @lquote = @rquote = '"'.freeze
  @lparen = '('.freeze
  @rparen = ')'.freeze

  @scope = nil
  @scope_token = nil
  @scope_upcase = false

  @verbose = false

  opts.each do |name,val|
    send(name.to_s + '=', val)
  end
end

Public Instance Methods

log(l = nil) { || ... } click to toggle source
# File lib/human-ql/query_parser.rb, line 240
def log(l = nil)
  if @verbose
    l = yield if block_given?
    $stderr.puts(l)
  end
end
norm_infix(q) click to toggle source

Treat various punctuation form operators as always being separate tokens per infix_token pattern. Note: Must always call norm_space after this.

# File lib/human-ql/query_parser.rb, line 314
def norm_infix(q)
  q.gsub(@infix_token, ' \0 ')
end
norm_phrase_tokens(tokens) click to toggle source

Select which tokens survive in a phrase. Also passes each token though norm_term. Tokens matching lparen and rparen are dropped.

# File lib/human-ql/query_parser.rb, line 365
def norm_phrase_tokens(tokens)
  tokens.
    reject { |t| @lparen === t || @rparen === t }.
    map { |t| norm_term(t) }
end
norm_prefix(q) click to toggle source

Split prefixes as separate tokens per prefix_token pattern.

# File lib/human-ql/query_parser.rb, line 319
def norm_prefix(q)
  if @prefix_token
    q.gsub(@prefix_token, '\0 ')
  else
    q
  end
end
norm_scope(q) click to toggle source

If scope_token is specified, normalize scopes as separate 'SCOPE:' tokens. This expects the 2nd capture group of scope_token to be the actual matching scope name, if present.

# File lib/human-ql/query_parser.rb, line 331
def norm_scope(q)
  if @scope_token
    q.gsub(@scope_token) do
      if $2
        $2 + ': '
      else
        ' '
      end
    end
  else
    q
  end
end
norm_space(q) click to toggle source

Normalize any whitespace to a single ASCII SPACE character and strip leading/trailing whitespace.

# File lib/human-ql/query_parser.rb, line 347
def norm_space(q)
  q.gsub(@spaces, ' ').strip
end
norm_term(t) click to toggle source

No-op in this implementation but may be used to replace characters. Should not receive nor return null or empty values.

# File lib/human-ql/query_parser.rb, line 373
def norm_term(t)
  t
end
normalize(q) click to toggle source

Runs the suite of initial input norm_* functions. Returns nil if the result is empty.

# File lib/human-ql/query_parser.rb, line 353
def normalize(q)
  q ||= ''
  q = norm_infix(q)
  q = norm_scope(q)
  q = norm_prefix(q)
  q = norm_space(q)
  q unless q.empty?
end
parse(q) click to toggle source
# File lib/human-ql/query_parser.rb, line 227
def parse(q)
  unless @default_op == :and || @default_op == :or
    raise("QueryParser#default_op is (#{@default_op.inspect}) " +
           "(should be :and or :or)")
  end
  q = normalize(q)
  tokens = q ? q.split(' ') : []
  log { "Parse: " + tokens.join(' ') }
  ast = parse_tree(tokens)
  log { "AST: " + ast.inspect }
  ast
end
parse_tree(tokens) click to toggle source
# File lib/human-ql/query_parser.rb, line 247
def parse_tree(tokens)
  s = ParseState.new(self)
  while (t = tokens.shift)
    case t
    when @lquote
      rqi = tokens.index { |tt| @rquote === tt }
      if rqi
        s.push_term([ :phrase, *norm_phrase_tokens(tokens[0...rqi]) ])
        tokens = tokens[rqi+1..-1]
      end # else ignore
    when @lparen
      rpi = rparen_index(tokens)
      if rpi
        s.push_term(parse_tree(tokens[0...rpi]))
        tokens = tokens[rpi+1..-1]
      end # else ignore
    when @rquote
    #ignore
    when @rparen
    #ignore
    when @scope
      s.push_op(scope_op(t))
    when @or_token
      s.push_op(:or)
    when @and_token
      s.push_op(:and)
    when @not_token
      s.push_op(:not)
    else
      s.push_term(norm_term(t))
    end
  end
  s.flush_tree
end
rparen_index(tokens) click to toggle source

Find token matching rparen in remaining tokens.

# File lib/human-ql/query_parser.rb, line 291
def rparen_index(tokens)
  li = 1
  phrase = false
  tokens.index do |tt|
    if phrase
      phrase = false if @rquote === tt
    else
      case tt
      when @rparen
        li -= 1
      when @lparen
        li += 1
      when @lquote
        phrase = true
      end
    end
    (li == 0)
  end
end
scope_op(token) click to toggle source

Given scope token, return the name (minus trailing ':'), upcased if scope_upcase.

# File lib/human-ql/query_parser.rb, line 284
def scope_op(token)
  t = token[0...-1]
  t.upcase! if @scope_upcase
  t
end
scopes=(scopes) click to toggle source

Given one or an Array of scope prefixes, generate the scope and scope_token patterns. A trailing hash is interpreted as options, see below.

Options

:ignorecase

If true, generate case insensitive regexes and upcase the scope in AST output (per scope_upcase)

# File lib/human-ql/query_parser.rb, line 152
def scopes=(scopes)
  scopes = Array(scopes)
  opts = scopes.last.is_a?(Hash) && scopes.pop || {}
  ignorecase = !!(opts[:ignorecase])
  if scopes.empty?
    @scope = nil
    @scope_token = nil
  elsif scopes.length == 1 && !ignorecase
    s = scopes.first
    @scope = (s + ':').freeze
    @scope_token = /((?<=\A|#{SP})(#{s}))?#{SP}*:/.freeze
  else
    opts = ignorecase ? Regexp::IGNORECASE : nil
    s = Regexp.union(*scopes).source
    @scope = Regexp.new('\A(' + s + '):\z', opts).freeze
    @scope_token = Regexp.new("((?<=\\A|#{SP})(#{s}))?#{SP}*:",
                               opts).freeze
  end
  @scope_upcase = ignorecase
  nil
end