Google Search Syntax BNF
posted by brian at 03:39 AM
A while ago I was making a specialized search engine of sorts and I wanted to use Google's nice query syntax with my own custom modifiers instead of things like inurl:, intitle:, site:, etc. Besides these powerful modifiers, Google's search syntax is nice because no query is invalid. Yahoo, MSN, and Amazon also at least support more than just basic search expressions.
Sometimes I wish other sites (like reddit) would implement this syntax for their search queries. So tomorrow I'll release a little Python module that parses this query syntax and makes the query easy to read and process. I wrote code that did this successfully a year or two ago, but I'd like to rewrite it now with pyparsing or something. Now there will be no excuse to offer only lame search queries.
To get the ball rolling, here's the BNF for the syntax I will implement. I have no idea if this is even a proper way to express BNF, but I used the Python grammar as a reference (because if it's good enough for Guido, it's good enough for me).
L ::= expr | expr L
expr ::= term | binary_expr
binary_expr ::= term " " binary_op " " term
binary_op ::= "*" | "OR" | "AND"
include_bool ::= "+" | "-"
term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~" literal)
modifier ::= (letter | "_")+
literal ::= word | quoted_words
quoted_words ::= '"' word (" " word)* '"'
word ::= (letter | digit | "_")+
number ::= digit+
range ::= number (".." | "...") number
letter ::= "A"..."Z" | "a"..."z"
digit ::= "0"..."9"
Contrary to what many people believe, you can NOT use parentheses for grouping or precedence in Google search queries. Every punctuation character except for [+-_".] is converted to a space and becomes meaningless.
Comments
If you are interested, there are two new examples in the next release of pyparsing. One is a search query string parser, the other parses Python's EBNF grammar. Let me know if you want an early copy of either.
-- Paul
Great, I will be interested!
Is the query syntax that will be included the Lucene syntax? Google's is very close but is not as fully-featured, which is what I want actually, but the standard Lucene syntax is fine too. I'm also not sure how Lucene takes care of all the corner cases when strange characters are inserted.
If you are interested, I have used pyparsing to make a Google-style parser for the search interface to my database. It isn't _exactly_ the Google syntax, but good enough for me to use for now.
Will send it to you via mail.
(I also did a lightning-talk on this at the Dutch Python Usergroup last week)
Where did you find references for Google's grammar? I'm working on a similar implementation, and would like to have a definition of the Google grammar to work from ( and show to my boss for reference ).
Thanks!
Google grammer isn't all that simple anymore. At some point since I started using it google started stemming words. The stemming in google groups includes adding alternate forms using non-english vowels. however, I often use a hyphen as a in-fix phrase operator. It will match the two words combined into one word, the two words as a phrase, or the two words hyphinated. compair the ham eggs search with the ham-eggs search with the "ham eggs" search. Try the same search with google desktop. Also compair the different way underbar (_)is parsed in desktop v. google web.