Tuesday 25 June 2013

Working of LEX

If you've ever written a compiler using compiler construction tools, you have probably heard about or even used LEX for generating a lexical analyser. This article tries to give an insight into the working of LEX. 


A brief introduction 

The LEX tool is used to find a certain pattern in the input stream and execute a corresponding action associated with it, as specified in the LEX program. The patterns are specified in the form of regular expressions and the actions as C code. Technically, LEX is a compiler which compiles a LEX program (prog_file.l) and generates a C file (lex.yy.c) which is in turn compiled using a C compiler to generate an executable file, which is the generated lexical analyser. 


But, why use LEX ?

The reason LEX is used instead of cascading a series of if-else statements with stcmp() conditions in an attempt to manually write a lexical analyzer in C , is because LEX offers :

  • Ability to handle character sequences as abstract entities
  • Ease of handling errors
  • Speed and efficiency



So what does this generated lex.yy.c file contain ?

Conceptually, LEX converts all the regular expressions into a finite state machine which it uses to accept or reject a string in the input stream. The corresponding action is executed when the machine is in accept state. The LEX compiler stores information about the constructed finite state machine in the form of a decision table (transition table) in the lex.yy.c file. Also the corresponding actions and the information regarding when they are to be executed is stored in the lex.yy.c file. A transition() function is used to access the decision table. LEX makes it's decision table visible if we compile the program with the -T flag.

Example : 
lex -T prog_file.l

The finite state machine used by LEX is deterministic in nature i.e. it is a DFA. The simulation of the constructed DFA is done in the lex.yy.c file. Hence, a LEX compiler constructs a DFA according to the specifications of the regular expression in the LEX program, and generates a simulation algorithm (to simulate the DFA) and a matching switch-case algorithm (to match and execute the appropriate action if the DFA enters an accept state).


How is the constructed DFA simulated ?

The working of the constructed DFA is simulated using the following algorithm.

DFA_simulator()
  current_state = start_state
  c = get_next_char()
  while(c != EOF)
    current_state = transition(current_state , c)
    c = get_next_char()
    if(current_state Final_states)
    /*ACCEPT*/
    else
    /*REJECT*/

The information about all the transitions made by the DFA can be obtained from the decision table (generally a two dimensional matrix) through the transition() function.

Friday 21 June 2013

Finding the longest matching pattern in a given input using FSM simulation

Finding a string in the input that matches a given pattern is a task accomplished by a Lexical Analyzer. LEX (a lexical analyzer generator) uses two disambiguation rules, one of which is to give more preference to the longer string in the input which matches a given pattern (in the form of a regular expression).This article simply explains how this is achieved in detail.            

LEX converts the regular expressions given to it into a Finite Automaton, which is deterministic in nature. To see how this is done , visit  http://silcnitc.github.io/documentation.html. It has not been explained here because it will take us beyond the scope of the article. 

The constructed DFA is then used to accept or reject a string in the input. Simply accepting a string when the automaton enters an accept state would not suffice, because this wouldn't correspond to the longest match, instead it would find the shortest match. The automaton must remember the last time it had entered the accepting state so that once the automaton enters a dead state, (A dead state is reached when the automaton has no further transitions from the current state on the provided input) it can simply backtrack to the last accepted state and accept the string which caused it to enter an accept state in the first place. 

In this manner, the automaton will always look for a longer matching string (call it string A) in the input (even if a string B has already been matched). If A exists for the same pattern, then the automaton accepts that string. If there is no such string A, the automaton will simply end up in a dead state, and since it had kept track of the last accepted state, it simply backtracks to that state and accepts string B.

In order to achieve this, the automaton simulator program must keep track of two variables : 

1. Last accepted state
2. Position of the input scanner at the last accepted state

The requirement for the latter of the two is so that the input scanner can resume scanning from this position if it encounters a dead state. i.e. It must resume scanning from after string B in the input. Hence now the scanner restarts scanning from this position, thus treating it as a new input string. This process continues till the end of the input is reached.

NOTE: String B is just a part of the input, not the entire input itself. Same applies for string A.

How is this achieved ?  (Algorithm) :


/*NOTE :
    ->Here states are numbered {0,1,2....}
    ->Transition table is a double dimensional array indexed with states(rows) and inputs(columns) and resultant state stored in the elements.  
    -> ' \in ' is a LATEX notation for 'belongs to'
    ->'0' is a dead state
    ->'1' is the start state
    ->Accepting_states[ ] is an array of accepting states
*/ 
Last_acc_pos=0         //Last position at which the automaton reached the accepting state

Last_acc_state=0       //Last state which was accepted by the automaton
Current_pos=0           //Current position of the scanner


WHILE ( !EOF )    c=getNextChar_from( Current_pos )
    Current_pos++
    //Setting the current state of the machine 
    IF ( Last_acc_state != 0 )
          Current_state = Transition[Last_acc_state,c]
    ELSE
          Current_state = Last_acc_state = 1
          Current_pos--    //retract by one position because read was insignificant here
    //Checking whether to accept and/or continue scanning
    IF ( Current_state \in Accepting_states )
          Last_acc_state=Current_state
          Last_acc_pos=Current_pos
    ELSE IF (Current_state = 0)
          /*Print acceptance of string */
          Last_acc_state=0
          Current_pos=Last_acc_pos