Parsing a search string

Freddie · Dec 31, 2004

Happy new year! Since I have run out of alcohol, I'll ask a question that I
haven't really worked out an answer for yet. Is there an elegant way to turn
something like:

> moo cow "farmer john" -zug

into:

['moo', 'cow', 'farmer john'], ['zug']

I'm trying to parse a search string so I can use it for SQL WHERE constraints,
preferably without horrifying regular expressions. Uhh yeah.

From 2005,
Freddie

M.E.Farmer · Dec 31, 2004

How ,
I just posted on something similar earlier

Ok first of all you might want to try shlex it is in the standard
library.
If you don't know what cStringIO is dont worry about it it is just to
give a file like object to pass to shlex.
If you have a file just pass it in opened.
example: a = shlex.shlex(open('mytxt.txt','r'))

py>import cStringIO
py>d = cStringIO.StringIO()
py>d.write('moo cow "farmer john" -zug')
py>d.seek(0)
py>a = shlex.shlex(d)
py>a.get_token()
'moo'
py>a.get_token()
'cow'
py>a.get_token()
'"farmer john"'
py>a.get_token()
'-'
py>a.get_token()
'zug'
py>a.get_token()
''
# ok we try again this time we add - to valid chars so we can get it
grouped as a single token .
py>d.seek(0)
py>a = shlex.shlex(d)
py>a.wordchars += '-' # add the hyphen
py>a.get_token()
'moo'
py>a.get_token()
'cow'
py>a.get_token()
'"farmer john"'
py>a.get_token()
'-zug'
py>a.get_token()
''

Hth,
M.E.Farmer

Fuzzyman · Dec 31, 2004

That's not bad going considering you've only run out of alcohol at 6 in
the morning and *then* ask python questions.

Anyway - you could write a charcter-by-character parser function that
would do that in a few minutes...

My 'listquote' module has one - but it splits on commas not whitespace.
Sounds like you're looking for a one-liner though.... regular
expressions *could* do it...............

Regards,

Fuzzy
http://www.voidspace.org.uk/atlantibots/pythonutils.html#llistquote

Reinhold Birkenfeld · Dec 31, 2004

Freddie said:
Happy new year! Since I have run out of alcohol, I'll ask a question that I
haven't really worked out an answer for yet. Is there an elegant way to turn
something like:

moo cow "farmer john" -zug

Click to expand...

into:

['moo', 'cow', 'farmer john'], ['zug']

I'm trying to parse a search string so I can use it for SQL WHERE constraints,
preferably without horrifying regular expressions. Uhh yeah.

The shlex approach, finished:

searchstring = 'moo cow "farmer john" -zug'
lexer = shlex.shlex(searchstring)
lexer.wordchars += '-'
poslist, neglist = [], []
while 1:
token = lexer.get_token()
# token is '' on eof
if not token: break
# remove quotes
if token[0] in '"\'':
token = token[1:-1]
# select in which list to put it
if token[0] == '-':
neglist.append(token[1:])
else:
poslist.append(token)

regards,
Reinhold

M.E.Farmer · Dec 31, 2004

As I noted before shlex requires a file like object or a open file .
py> import shlex
py> a = shlex.shlex('fgfgfg dgfgfdgfdg')
py> a.get_token()
Traceback (most recent call last):
File "<input>", line 1, in ?
File ".\shlex.py", line 74, in get_token
raw = self.read_token()
File ".\shlex.py", line 100, in read_token
nextchar = self.instream.read(1)
AttributeError: 'str' object has no attribute 'read'

M.E.Farmer

Reinhold Birkenfeld · Dec 31, 2004

M.E.Farmer said:
As I noted before shlex requires a file like object or a open file .
py> import shlex
py> a = shlex.shlex('fgfgfg dgfgfdgfdg')
py> a.get_token()
Traceback (most recent call last):
File "<input>", line 1, in ?
File ".\shlex.py", line 74, in get_token
raw = self.read_token()
File ".\shlex.py", line 100, in read_token
nextchar = self.instream.read(1)
AttributeError: 'str' object has no attribute 'read'

Which Python version are you using?

The docs say that since Py2.3 strings are accepted.

regards,
Reinhold

It's me · Dec 31, 2004

I am right in the middle of doing text parsing so I used your example as a
mental exercise.

Here's a NDFA for your text:

b 0 1-9 a-Z , . + - ' " \n
S0: S0 E E S1 E E E S3 E S2 E
S1: T1 E E S1 E E E E E E T1
S2: S2 E E S2 E E E E E T2 E
S3: T3 E E S3 E E E E E E T3

and the end-states are:

E: error in text
T1: You have the words: moo, cow
T2: You get "farmer john" (w quotes)
T3: You get zug

Can't gurantee that I did it right - I did it really quick - and it's
*specific* to your text string.

Now just need to hire a programmer to write some clean Python parsing code.

--
It's me

Freddie said:
Happy new year! Since I have run out of alcohol, I'll ask a question that I
haven't really worked out an answer for yet. Is there an elegant way to turn
something like:

moo cow "farmer john" -zug

Click to expand...

into:

['moo', 'cow', 'farmer john'], ['zug']

I'm trying to parse a search string so I can use it for SQL WHERE constraints,
preferably without horrifying regular expressions. Uhh yeah.

From 2005,
Freddie

M.E.Farmer · Dec 31, 2004

Ah! that is what the __future__ brings I guess.........
Damn that progress making me outdated

Python 2.2.3 ( a lot of extensions I use are stuck there , so I still
use it)
M.E.Farmer

Reinhold Birkenfeld · Dec 31, 2004

M.E.Farmer said:
Ah! that is what the __future__ brings I guess.........
Damn that progress making me outdated
Python 2.2.3 ( a lot of extensions I use are stuck there , so I still
use it)

I'm also positively surprised how many cute little additions are there
every new Python version. Great thanks to the great devs!

Reinhold

Andrew Dalke · Dec 31, 2004

It's me said:
Here's a NDFA for your text:

b 0 1-9 a-Z , . + - ' " \n
S0: S0 E E S1 E E E S3 E S2 E
S1: T1 E E S1 E E E E E E T1
S2: S2 E E S2 E E E E E T2 E
S3: T3 E E S3 E E E E E E T3

Now if I only had an NDFA for parsing that syntax...

Andrew
(e-mail address removed)

It's me · Dec 31, 2004

Andrew Dalke said:
Now if I only had an NDFA for parsing that syntax...

Just finished one (don't ask me to show it - very clumpsy Python code -
still in learning mode).

Here's one for parsing integer:

# b 0 1-9 , . + - ' " a-Z \n
# S0: S0 S0 S1 T0 E S2 S2 E E E T0
# S1: S3 S1 S1 T1 E E E E E E T1
# S2: E S2 S1 E E E E E E E E
# S3: S3 T2 T2 T1 T2 T2 T2 T2 T2 T2 T1

T0: you got a null token
T1: you got a good token, separator was ","
T2: you got a good token b, separator was " "
E: bad token

Brian Beck · Dec 31, 2004

Freddie said:
I'm trying to parse a search string so I can use it for SQL WHERE
constraints, preferably without horrifying regular expressions. Uhh yeah.

If you're interested, I've written a function that parses query strings
using a customizable version of Google's search syntax.

Features include:
- Binary operators like OR
- Unary operators like '-' for exclusion
- Customizable modifiers like Google's site:, intitle:, inurl: syntax
- *No* query is an error (invalid characters are fixed up, etc.)
- Result is a dictionary in one of two possible forms, both geared
towards being input to an search method for your database

I'd be glad to post the code, although I'd probably want to have a last
look at it before I let others see it...

John Machin · Dec 31, 2004

Andrew said:
Now if I only had an NDFA for parsing that syntax...

Parsing your sentence as written ("if I only had"): If you were the
sole keeper of the secret??

Parsing it as intended ("if only I had"), and ignoring the smiley:
Looks like a fairly straight-forward state-transition table to me. The
column headings are not aligned properly in the message, b means blank,
a-Z is bletchworthy, but the da Vinci code it ain't.

If only we had an NDFA (whatever that is) for guessing what acronyms
mean ...

Where I come from:
DFA = deterministic finite-state automaton
NFA = non-det......
SFA = content-free
NFI = concept-free
NDFA = National Dairy Farmers' Association

HTH, and Happy New Year!

It's me · Dec 31, 2004

John Machin said:
Parsing your sentence as written ("if I only had"): If you were the
sole keeper of the secret??

Parsing it as intended ("if only I had"), and ignoring the smiley:
Looks like a fairly straight-forward state-transition table to me.
Exactly.

The
column headings are not aligned properly in the message, b means blank,
a-Z is bletchworthy, but the da Vinci code it ain't.

If only we had an NDFA (whatever that is) for guessing what acronyms
mean ...

I believe (I am not a computer science major):

NDFA = non-deterministic finite automata

and:

S: state
T: terminal
E: error

So, S1 means State #1..T1 means Terminal #1, so forth....

You are correct that parsing that table is not hard.

a) Set up a stack and place the buffer onto the stack, start with S0
b) For each character that comes from the stack, looking up the next state
for that token
c) If it's not a T or E state, jump to that state
d) If it's a T or E state, finish

Freddie · Jan 1, 2005

Reinhold said:
Freddie said:

Happy new year! Since I have run out of alcohol, I'll ask a question that I
haven't really worked out an answer for yet. Is there an elegant way to turn
something like:

moo cow "farmer john" -zug

Click to expand...

into:

['moo', 'cow', 'farmer john'], ['zug']

I'm trying to parse a search string so I can use it for SQL WHERE constraints,
preferably without horrifying regular expressions. Uhh yeah.

Click to expand...

The shlex approach, finished:

searchstring = 'moo cow "farmer john" -zug'
lexer = shlex.shlex(searchstring)
lexer.wordchars += '-'
poslist, neglist = [], []
while 1:
token = lexer.get_token()
# token is '' on eof
if not token: break
# remove quotes
if token[0] in '"\'':
token = token[1:-1]
# select in which list to put it
if token[0] == '-':
neglist.append(token[1:])
else:
poslist.append(token)

regards,
Reinhold

Thanks for this, though there was one issue:
.... tok = lexer.get_token()
.... if not tok: break
.... print tok
....
moo
cow
+"farmer
john"
-dog

The '+"farmer john"' part would be turned into two seperate words, '+"farmer'
and 'john"'. I ended up using shlex.split() (which the docs say is new in
Python 2.3), which gives me the desired result. Thanks for the help from
yourself and M.E.Farmer

Freddie

>>> shlex.split('moo cow +"farmer john" -"evil dog"') ['moo', 'cow', '+farmer john', '-evil dog']
>>> shlex.split('moo cow +"farmer john" -"evil dog" +elephant')

Click to expand...

Click to expand...

['moo', 'cow', '+farmer john', '-evil dog', '+elephant']

M.E.Farmer · Jan 1, 2005

py>b = shlex.shlex(a)
py>while 1:
.... tok = b.get_token()
.... if not tok: break
.... print tok
....
moo
cow
+
"farmer john"
-
dog

Just wanted to share this just in case it might be relevant .
It seems if we don't add +- to wordchars then we get a different split
on "farmer john".
M.E.Farmer

Search for a string in another string allowing mismatches	3	Sep 21, 2010
[SUMMARY] Parsing JSON (#155)	12	Feb 7, 2008
Easily parsing a string to retrieve values and assign them to a variable/symbol.	6	Jul 18, 2007
google like search syntax parsing (also posted in sql programming group)	5	Oct 29, 2005
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Ruby Weekly News 2nd - 8th January 2006	0	Jan 10, 2006
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
using re: hitting recursion limit	6	Oct 26, 2004

Parsing a search string

Freddie

M.E.Farmer

Fuzzyman

Reinhold Birkenfeld

M.E.Farmer

Reinhold Birkenfeld

It's me

M.E.Farmer

Reinhold Birkenfeld

Andrew Dalke

It's me

Brian Beck

John Machin

It's me

Freddie

M.E.Farmer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads