EzDevInfo.com

ply

Python Lex-Yacc PLY (Python Lex-Yacc)

How to write a regular expression to match a string literal where the escape is a doubling of the quote character?

I am writing a parser using ply that needs to identify FORTRAN string literals. These are quoted with single quotes with the escape character being doubled single quotes. i.e.

'I don''t understand what you mean'

is a valid escaped FORTRAN string.

Ply takes input in regular expression. My attempt so far does not work and I don't understand why.

t_STRING_LITERAL = r"'[^('')]*'"

Any ideas?


Source: (StackOverflow)

Lex strings with single, double, or triple quotes

My objective is to parse like Python does with strings.

Question: How to write a lex to support the following:

  1. "string..."
  2. 'string...'
  3. """multi line string \n \n end"""
  4. '''multi line string \n \n end'''

Some code:

states = (
        ('string', 'exclusive'),
        )

# Strings
def t_begin_string(self, t):
    r'(\'|(\'{3})|\"|(\"{3}))'
    t.lexer.push_state('string')

def t_string_end(self, t):
    r'(\'|(\'{3})|\"|(\"{3}))'
    t.lexer.pop_state()

def t_string_newline(self, t):
    r'\n'
    t.lexer.lineno += 1

def t_string_error(self, t):
    print("Illegal character in string '%s'" % t.value[0])
    t.lexer.skip(1)


My current idea is to create 4 unique states that will match the 4 different string cases, but I'm wondering if there's a better approach.

Thanks for your help!


Source: (StackOverflow)

Advertisements

How to get PLY to ignore case of a regular expression?

I'm working on a simple translator from SQL INSERT statements to a dataset XML file to be used with DbUnit.

My current definition looks like this:

def t_INSERT(token):
    r'INSERT\s+INTO'
    return token

Now, I want to support case insensitive commands of SQL, for example, accept all of INSERT INTO, Insert Into, insert into and iNsErT inTO as the same thing.

I wonder if there is a way to PLY use re.I so that it will ignore the case, or yet another alternative to write the rule that I'm not familiar with.


Source: (StackOverflow)

RegEx with variable data in it - ply.lex

im using the python module ply.lex to write a lexer. I got some of my tokens specified with regular expression but now im stuck. I've a list of Keywords who should be a token. data is a list with about 1000 Keywords which should be all recognised as one sort of Keyword. This can be for example: _Function1 _UDFType2 and so on. All words in the list are separated by whitespaces thats it. I just want that lexer to recognise the words in this list, so that it would return a token of type `KEYWORD.

data = 'Keyword1 Keyword2 Keyword3 Keyword4'
def t_KEYWORD(t):
    # ... r'\$' + data ??
    return t

text = '''
Some test data


even more

$var = 2231




$[]Test this 2.31 + / &
'''

autoit = lex.lex()
autoit.input(text)
while True:
    tok = autoit.token()
    if not tok: break
    print(tok)

So i was trying to add the variable to that regex, but it didnt work. I'm always gettin: No regular expression defined for rule 't_KEYWORD'.

Thank you in advance! John


Source: (StackOverflow)

How to discard non-terminal in grammar file with PLY (Python Lex-Yacc)

I have faced a problem when using PLY. I want to create a call graph generator by PLY. In some situation, I need to discard some tokens in the grammar file. That is because I need to do something when the parser recognize that token before I discard it, so I can't just discard in the lexer file. For example, the 'IF' token is the one which I want to discard. So I try to do something to discard it in the grammar file. Just like:

def p_if(p):
    'if : IF'
    print "if"
    parser.symstack.pop()

But things didn't go the way I think. I print the symstack(it's a atribute of parser, and parser is a LRParser instance of yacc.py), and the symstack list just contain the previous tokens but not 'if'. So I am wondering how to discard a token in this situation. Could anyone help me? Thanks a lot!


Source: (StackOverflow)

How can I create a ply rule for recognizing CRs?

I have trouble with distinguishing between \r (0x0d) and \n (0x0a) in my PLY lexer.

A minimal example is the following program

import ply.lex as lex

# token names
tokens = ('CR', 'LF')

# token regexes
t_CR = r'\r'
t_LF = r'\n'

# chars to ignore
t_ignore  = 'abc \t'

# Build the lexer
lexer = lex.lex()

# lex
f = open('foo', 'r')
lexer.input(f.read())
while True:
    tok = lexer.token()
    if not tok: break
    print(tok)

Now creating a file foo as follows:

printf "a\r\n\r\rbc\r\n\n\r" > foo

Verifying that it looks ok:

hd foo
00000000  61 0d 0a 0d 0d 62 63 0d  0a 0a 0d                 |a....bc....|
0000000b

Now I had assumed that I would get some CR and some LF tokens, but:

python3 crlf.py 
WARNING: No t_error rule is defined
LexToken(LF,'\n',1,1)
LexToken(LF,'\n',1,2)
LexToken(LF,'\n',1,3)
LexToken(LF,'\n',1,6)
LexToken(LF,'\n',1,7)
LexToken(LF,'\n',1,8)

it turns out I only get LF tokens. I would like to know why this happens, and how I should do it instead.

This is Python 3.2.3 on Ubuntu 12.04


Source: (StackOverflow)

Tokenizing left over data with lex/yacc

Forgive me, I'm completely new to parsing and lex/yacc, and I'm probably in way over my head, but nonetheless:

I'm writing a pretty basic calculator with PLY, but it's input might not always be an equation, and I need to determine if it is or not when parsing. The extremes of the input would be something that evaluates perfectly to an equation, which it parses fine and calculates, or something that is nothing like an equation, which fails parsing and is also fine.

The gray area is an input that has equation-like parts, of which the parser will grab and work out. This isn't what I want - I need to be able to tell if parts of the string didn't get picked up and tokenized so I can throw back an error, but I have no idea how to do this.

Does anyone know how I can define, basically, a 'catch anything that's left' token? Or is there a better way I can handle this?


Source: (StackOverflow)

ply lexmatch regular expression has different groups than a usual re

I am using ply and have noticed a strange discrepancy between the token re match stored in t.lex.lexmatch, as compared with an sre_pattern defined in the usual way with the re module. The group(x)'s seem to be off by 1.

I have defined a simple lexer to illustrate the behavior I am seeing:

import ply.lex as lex

tokens = ('CHAR',)

def t_CHAR(t):
    r'.'
    t.value = t.lexer.lexmatch
    return t

l = lex.lex()

(I get a warning about t_error but ignore it for now.) Now I feed some input into the lexer and get a token:

l.input('hello')
l.token()

I get a LexToken(CHAR,<_sre.SRE_Match object at 0x100fb1eb8>,1,0). I want to look a the match object:

m = _.value

So now I look at the groups:

m.group() => 'h' as I expect.

m.group(0) => 'h' as I expect.

m.group(1) => 'h', yet I would expect it to not have such a group.

Compare this to creating such a regular expression manually:

import re
p = re.compile(r'.')
m2 = p.match('hello')

This gives different groups:

m2.group() = 'h' as I expect.

m2.group(0) = 'h' as I expect.

m2.group(1) gives IndexError: no such group as I expect.

Does anyone know why this discrepancy exists?


Source: (StackOverflow)

LALR grammar, trailing comma and multiline list assignment

I'm trying to produce a LALR grammar for a very simple language composed of assignments. For example:

foo = "bar"
bar = 42

The language should also handle list of values, for example:

foo = 1, 2, 3

But I also want to handle list on multiple lines:

foo = 1, 2
      3, 4

Trailing comma (for singletons and language flexibility):

foo = 1,
foo = 1, 2,

And obviously, both at the same time:

foo = 1,
      2,
      3,

I'm able to write a grammar with trailing comma or multi-line list, but not for both at the same time.

My grammar look like this:

content : content '\n'
        : content assignment
        | <empty>

assignment : NAME '=' value
           | NAME '=' list

value : TEXT
      | NUMBER

list : ???

Note: I need the '\n' in the grammar to forbid this kind of code:

foo
=
"bar"

Thanks by advance,

Antoine.


Source: (StackOverflow)

PLY: Token shifting problem in C parser

I'm writing a C parser using PLY, and recently ran into a problem. This code:

typedef int my_type;
my_type x;

Is correct C code, because my_type is defined as a type previously to being used as such. I handle it by filling a type symbol table in the parser that gets used by the lexer to differentiate between types and simple identifiers.

However, while the type declaration rule ends with SEMI (the ';' token), PLY shifts the token my_type from the second line before deciding it's done with the first one. Because of this, I have no chance to pass the update in the type symbol table to the lexer and it sees my_type as an identifier and not a type.

Any ideas for a fix ?

The full code is at: http://code.google.com/p/pycparser/source/browse/trunk/src/c_parser.py Not sure how I can create a smaller example out of this.

Edit:

Problem solved. See my solution below.


Source: (StackOverflow)

Using PLY to parse SQL statements

I know there are other tools out there to parse SQL statements, but I am rolling out my own for educational purposes. I am getting stuck with my grammar right now.. If you can spot an error real quick please let me know.

SELECT = r'SELECT'
FROM = r'FROM'
COLUMN = TABLE = r'[a-zA-Z]+'
COMMA = r','
STAR = r'\*'
END = r';'
t_ignore = ' ' #ignores spaces

statement : SELECT columns FROM TABLE END

columns : STAR
        | rec_columns

rec_columns : COLUMN
            | rec_columns COMMA COLUMN

When I try to parse a statement like 'SELECT a FROM b;' I get an syntax error at the FROM token... Any help is greatly appreciated!

(Edit) Code:

#!/usr/bin/python
import ply.lex as lex
import ply.yacc as yacc

tokens = (
    'SELECT',
    'FROM',
    'WHERE',
    'TABLE',
    'COLUMN',
    'STAR',
    'COMMA',
    'END',
)

t_SELECT    = r'select|SELECT'
t_FROM      = r'from|FROM'
t_WHERE     = r'where|WHERE'
t_TABLE     = r'[a-zA-Z]+'
t_COLUMN    = r'[a-zA-Z]+'
t_STAR      = r'\*'
t_COMMA     = r','
t_END       = r';'

t_ignore    = ' \t'

def t_error(t):
    print 'Illegal character "%s"' % t.value[0]
    t.lexer.skip(1)

lex.lex()

NONE, SELECT, INSERT, DELETE, UPDATE = range(5)
states = ['NONE', 'SELECT', 'INSERT', 'DELETE', 'UPDATE']
current_state = NONE

def p_statement_expr(t):
    'statement : expression'
    print states[current_state], t[1]

def p_expr_select(t):
    'expression : SELECT columns FROM TABLE END'
    global current_state
    current_state = SELECT
    print t[3]


def p_recursive_columns(t):
    '''recursive_columns : recursive_columns COMMA COLUMN'''
    t[0] = ', '.join([t[1], t[3]])

def p_recursive_columns_base(t):
    '''recursive_columns : COLUMN'''
    t[0] = t[1]

def p_columns(t):
    '''columns : STAR
               | recursive_columns''' 
    t[0] = t[1]

def p_error(t):
    print 'Syntax error at "%s"' % t.value if t else 'NULL'
    global current_state
    current_state = NONE

yacc.yacc()


while True:
    try:
        input = raw_input('sql> ')
    except EOFError:
        break
    yacc.parse(input)

Source: (StackOverflow)

Python PLY zero or more occurrences of a parsing item

I am using Python with PLY to parse LISP-like S-Expressions and when parsing a function call there can be zero or more arguments. How can I put this into the yacc code. This is my function so far:

def p_EXPR(p):
    '''EXPR : NUMBER
            | STRING
            | LPAREN funcname [EXPR] RPAREN'''
    if len(p) == 2:
        p[0] = p[1]
    else:
        p[0] = ("Call", p[2], p[3:-1])

I need to replace "[EXPR]" with something that allows zero or more EXPR's. How can I do this?


Source: (StackOverflow)

Should I use Lex or a home-brewed solution to parse a formula?

I'm in the process of writing a small, rule-based 'math' engine. I realize this is unclear, so I'll provide a small example.

Let's say you have some variable a, that holds an integer. You also have some functions you can apply to the number, i.e.

  • sqr - square the number
  • flp - flip the bits of the number
  • dec - decrement the number
  • inc - increment the number

You can then say, do_formula(a, "2sqr+inc+flp"). If a were 3, it'd square it twice (81), increment it (82), and flip the bits of it (~82 -- which is -83, if dealing with signed integers, I believe).

What would be the best way of parsing the formula? It's relatively simple, and I'm thinking of making all the opcodes be 3 characters... would it be overkill to use Lex? Should I just write a simple home-brewed solution or use something else entirely?

I realize the above example is silly; I'm not building a calculator that'll do that, but it illustrates what I'm trying to do well enough.


Source: (StackOverflow)

Does Pyparsing Support Context-Sensitive Grammars?

Forgive me if I have the incorrect terminology; perhaps just getting the "right" words to describe what I want is enough for me to find the answer on my own.

I am working on a parser for ODL (Object Description Language), an arcane language that as far as I can tell is now used only by NASA PDS (Planetary Data Systems; it's how NASA makes its data available to the public). Fortunately, PDS is finally moving to XML, but I still have to write software for a mission that fell just before the cutoff.

ODL defines objects in something like the following manner:

OBJECT              = TABLE
  ROWS              = 128
  ROW_BYTES         = 512 
END_OBJECT          = TABLE

I am attempting to write a parser with pyparsing, and I was doing fine right up until I came to the above construction.

I have to create some rule that is able to ensure that the right-hand-value of the OBJECT line is identical to the RHV of END_OBJECT. But I can't seem to put that into a pyparsing rule. I can ensure that both are syntactically valid values, but I can't go the extra step and ensure that the values are identical.

  1. Am I correct in my intuition that this is a context-sensitive grammar? Is that the phrase I should be using to describe this problem?
  2. Whatever kind of grammar this is in the theoretical sense, is pyparsing able to handle this kind of construction?
  3. If pyparsing is not able to handle it, is there another Python tool capable of doing so? How about ply (the Python implementation of lex/yacc)?

Source: (StackOverflow)

Match unicode in ply's regexes

I'm matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough:

t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"

In my markup language parser I match unicode characters by allowing all the characters except those I explicitly use, because my markup language only has two or three of characters I need to escape that way.

How do I match all unicode characters with python regexs and ply? Also is this a good idea at all?

I'd want to let people use identifiers like Ω » « ° foo² väli π as an identifiers (variable names and such) in their programs. Heck! I want that people could write programs in their own language if it's practical! Anyway unicode is supported nowadays in wide variety of places, and it should spread.

Edit: POSIX character classes doesnt seem to be recognised by python regexes.

>>> import re
>>> item = re.compile(r'[[:word:]]')
>>> print item.match('e')
None

Edit: To explain better what I need. I'd need a regex -thing that matches all the unicode printable characters but not ASCII characters at all.

Edit: r"\w" does a bit stuff what I want, but it does not match « », and I also need a regex that does not match numbers.


Source: (StackOverflow)