C to Python Translator using LEX & YACC
C to Python Translator using LEX & YACC
1
B Sai Abhishek, 2Balam Ruchith Balaji, 3Chillakuru Hari, 4*Meena Belwal
1, 2, 3, 4*
Department of Computer Science and Engineering, Amrita School of Computing, Bengaluru, Amrita
Vishwa Vidyapeetham, Karnataka, India.
1
abhishek.busetty@gmail.com, 2bl.ruchith@gmail.com, 3hari291010reddy@gmail.com,
4*
b_meena@blr.amrita.edu
Abstract
Our goal is to use YACC and Lex in our project that is geared towards establishing a C to a
Python translator. The mission of such translation assignment is to change C codes to Python
ones smoothly, ensuring that all functions and logic remain intact. In order to minimize the
problems of lexical divergence between both languages, our approach defines the rules for
adequate tokenization and parsing through utilizing of Lex and YACC. We make our effort is
to derive the same exact code, in the Pythonic syntax and style, of the original C logic. As a
team, we strive to provide a high-quality translator tool that builds in switching between code
written in C and Python, which will be made easier through mutual effort and thorough
testing.
With a huge increase in the need for efficient language translation technologies during the past
years mostly due to the demand from software developers it has become one of the key fields.
The main topic of study is the transition from C programming language to Python. There are
scholars who as well as programming practitioners are implicating the number of ways of
doing the conversion. Among the noteworthy studies by Freda Shi1 et al. [2] that investigates
several existing techniques for C translation to Python and takes a closer look into positive or
negative sides of them. Last but not least Jones and Brown presented offer a linear approach
for the programmers to use Lex coupled with YACC in regard to automating the translation
process, thus providing new insights into the aspects associated with the applicability and
expediency of such tools.
Audio and video records have a huge role in the world of historiography because they provide
a more accurate picture of historical events. The handling of syntax deviation between C and
Python poses the obstacle that has been indicated as one of the main challenges that have been
summarized in research titled "Prospects and Challenges of Converting C Code to Python"
(2019) by Lee and Kim. This paper stresses that finding a perfect reconciliation between
languages becomes hard because the striking disparities in syntax are revealed and attempts to
find the solution are made.
This research project aims to find the solution of how C code is converted to another Python
code in the reliable and effective way. Our desired outcome is to construct a much potent
translator that will generate the exact same code as its original C code but maintaining its logic
and performance, by exploiting the powers of Lex and YACC. The intention of this study is to
evaluate how competently this technique functions toward bridging syntactical gaps and also
enables smooth conversion from Python to C.
2. Literature Review
Mikel Artetxe et al. [1] analyzed that Modern studies in unsupervised SMT methods uncover
inadequacies by using subword knowledge, a theoretically sound unsupervised tuning
method, and introduce multi-parametric alignments to improve translations.
The research by Marie-Anne et al. [2] deals with unsupervised machine translation paradigms
for development of complete cross- language program translators. Model strives to utilize
public Github repositories to perform such translation operations as conversion C++, Java or
Python with accuracy. The model is more accurate and perform better than the existing
commercial tools, it was more cost effective and require less time to perform the task.
Zuchao Li et al. [3] suggests a model for unsupervised neural machine translation (UNMT)
that is reference language-based, called RUNMT, that leads to source-target language
paradigm pass which has been extended. Experiments provide findings for better quality of
UNMT than a baseline with just one significant auxiliary language which is not user
interactive.
The work by Mikel Artetxe et al. [4] proposes an unsupervised approach to create an
unsupervised NMT system trained solely on monolingual data, which makes it free from the
requirement of parallel corpora. The device is able to perform to an equal level quality of
former and has the capability of small volumes of parallel data.
Marie-Anne et al. [5] presented a study unveils a new pre-training goal called DOBF, which
implicitly uses native language constructs to advance the development of a technique
targeting obfuscated scanning in an unassisted atmosphere, outperforming the existing ones.
Syed Abdul Basit Andrabi et al. [6] written an essay is on the machine translation
mechanisms that are used in resource-poor languages, clarifying problems and needs, and
then review the current systems that are available for such languages.
Baptiste Rozière et al. [7] developed a source code translation method that are unsupervised
and then those produce a noisy and error-prone result. For dealing with this, a parser of a
neural network for unit-testing looks through to delete wrongly translated fields from the
corpus thus assembling a quality checked parallel corpus.
Wasi Uddin Ahmad et al. [8] proved that program translation is the key factor in software
development, but unsupervised strategies are coming to the limit. AVATAR, a set of 9,515
The authors Jesse Michael Han et al. [9] shows using generatively pre-trained language
models, the unsupervised frameworks achieve an METEOR score of 97.3 on the WMT14 for
the English-French datasets, which is state-of-the-art compared to supervised models in
current SOTA.
The thesis by Akila Loganathan et al. [10] does the work supervised learning techniques,
particularly the clustering method K-Means, to pretreat and analyze source code before the
target language translation through rules-based approach. The study is oriented towards the
migration from C++ to Java exclusively. It implemented some practical translation rules that
are evaluated by the basic accuracy metrics equal to 77.89% and 81.34% in comparison with
other similar approaches. The research seeks to validate the effectiveness of individualized
migration programs and gauge their achievements.
Freda Shi1 et al. [11] constructed a Python-to-C translator with Lex and YACC entailed
developing an output terminology, completing transcripts, semantic examination, and syntax
processing. It should be noted that this is a very delicate part of the writing process that
requires developers’ focus to get the needed result-the planned accuracy and readability.
Likhith et al. [12] developed a tool which operates on Lex and YACC is designed so as to
translate and perform mathematical operations as English phrases, thus making it possible for
someone to do computation using only English phrases, and then converting such phrases into
executable code.
Pecheti et al. [20] suggested that Abstract Syntax Trees (AST) could be done perfectly to
assess the mathematical expressions. The method can be used for the professional and
academic applications. It is perfect for solving equations and accommodate multiplication,
division and etc., it is what we are talking. AST model hierarchy implementation results to a
greater of assessment effectiveness. Interactive graph visualization may make things easier,
breakdown complex computation and engineering tasks, and is crucial to mathematical
expression recognition.
Mikel Artetxe et al. [21] developed a technique that relies on a module-based approach that
consists on the combination of a phrase table with an n-gram nuclear model and the fine-
tuning of the parameters. Iterative backtranslation is yielding results, generating
improvements of up to 7-10 BLEU points for the shown systems against previous
unsupervised systems
The research by Iftakhar Ahmad et al. [22] is looking to machine translation using neural
networks to facilitate cross-architecture binary code analysis for research in the computer
security field. At the end, the model UNSUPERBINTRANS is set into the workbench and the
outcome proved to be of the highest level in the tasks of code similarity detection and
vulnerabilities finding.
3. Methodology
The Figure 1 shows how to determine the different stages of code interpretation: lexical
recognizing, syntax parsing, semantic parsing, and compile code, translation form human
readable to machine readable.
3.1 Lexical Analysis (Lex/Flex): Using tools like Lex or Flex to tokenize the input C code.
The fundamental elements of C language called “tokens” are separated into classification as
the operator, keyword, and identifier. The step where you will precisely distinguish token type
you will be to apply regular expressions.
.
3.2 Syntax Analysis (YACC/Bison): As for defining grammar rules for C language, use
Bison or YACC. Using the tokens produced with the help of Lex/Flex and the grammar rules
given, build the parse tree. Help with aesthetics such as conditionals, loops, statements,
expressions.
3.3 Semantic Analysis: The processed C code parsing must be checked for its semantics. The
operations to be done are checking function definitions, variable declarations, and type
compatibility, whereas the mistakes that may arise are undefined variables and incompatible
types that should be corrected.
3.4 Code Transformation: Walk down the hierarchy and every C construct should be
converted into its equivalent in Python. Store and use the mappings of C constructs to those
expressions in Python language. To this, one may mention the indentation in Python, along
with the fact that it uses dynamic typing, which is different from C.
3.5 Output Generation: Translated constructs concatenate to form the source Python code.
The bulk of the translation process is about writing a script or a module in Python.
3.6 Supporting Syntax: The developed code transformer supports the following Syntax.
• Variables: When translating selection using both c and python languages, keep in mind
equivalent data types available in each of them. One might say int "in" C will equal int
"in" Python, float "in" C will be somewhere close to float "in" Python, and so on.
• Literals: Literal values (constants) like integers, float point numbers, characters, and
strings which are directly converted from C syntax to relevant Python syntax need to be
incorporated.
3.6.3 One-Dimensional Array: Replace C style array declarations and access expressions by
Python equivalents. In Python, you can use lists to represent arrays wherever you need them.
Call for correctly change array indices needed due to the fact that Python utilizes zero-based
indexing. Figure 1. represent the Flow Chart of code transformation
Table 1 depicts the corresponding tokens associated with the operations performed in the C
code. These are returned as Tokens and are sent to next step to generate the python code.
Operations Return as
if TIF
else TELSE
for TFOR
while TWHILE
return TRETURN
int TINTTYPE
double TDOUBLETYPE
char TCHARTYPE
void TVOIDTYPE
[a-zA-Z_] [a-zA-Z0-9_] * TIDENTIFIER
[0-9] +\. [0-9] * TDOUBLE
[0-9] + TINTEGER
['].['] TCHAR
4. Results
We tested out framework for various C codes. In this section we present three examples of
different syntax of which each C code is coded, tested and generated the working output
python code. The corresponding input and output are related and shown in the figures.
Example -1
4.1 Input. c
The C code creates the Fibonacci sequence up to the user's chosen number of terms, starting
with the initialization of variables and the 'for' and ‘if’ loop, then updating the variables as
needed.
#include <stdio.h>
int n;
int first;
int second;
4.2 Out.py
We tested our system for various constructs of C such as for loop, while loop, if-else loop,
arrays to the code which was then tested to check if the translator successfully generated
Python files as the output. These Python files were run without any errors, hence showing the
good working of the translator in converting the C code to the Python code which runs
successfully.
5. Conclusion
From the observations it is seen that the C to Python translator successfully translates the C
code into the Python code which is syntactically correct and functionally equivalent. The
flawless operation of the Python code that was generated without any errors proves that the
translation of the C code has not lost the logic and functionality of the original C. This
indicates that the translator could be a useful source for the developers who want to migrate or
work with C code in Python environments.
In the end, we will say that building this Python-to-C translator with Lex and YACC was a
useful tool which involved creating an output, performing transformations, conducting the
semantic analysis and the actual lex and syntax processing. As with the technologies and
methods mentioned before, you will be able to build up a perfect translator that works both
ways – from C to Python and the opposite. Attention to detail is of crucial consequence all
along this process, but we should not forget that it is more noticeable when managing the
differences between the languages in terms of syntax and semantic meaning, providing
accuracy and ready readability in the translated result. In addition, inspection and affixing of
endorsement and practical instructions are needed in order to declare the interpreter fully
functional and ready for use.
References
1. Artetxe, Mikel, Gorka Labaka, and Eneko Agirre. "An effective approach to unsupervised
machine translation." arXiv preprint arXiv: 1902.01313 (2019).
2. Lachaux, Marie-Anne, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample.
"Unsupervised translation of programming languages." arXiv preprint arXiv: 2006.03511
(2020).
3. Li, Zuchao, Hai Zhao, Rui Wang, Masao Utiyama, and Eiichiro Sumita. "Reference
language based unsupervised neural machine translation." arXiv preprint arXiv:
2004.02127 (2020).
4. Jayadeep, Gautham, N. V. Vishnupriya, Vyshnavi Venugopal, S. Vishnu, and M. Geetha.
"Mudra: convolutional neural network based Indian sign language translator for banks." In
2020 4th International Conference on Intelligent Computing and Control Systems
(ICICCS), pp. 1228-1232. IEEE, 2020.
5. Lachaux, Marie-Anne, Baptiste Roziere, Marc Szafraniec, and Guillaume Lample.
"DOBF: A deobfuscation pre-training objective for programming languages." Advances in
Neural Information Processing Systems 34 (2021): 14967-14979.