I was trying today to generate PDF reports using Geraldo Reports and I needed to generate reports with Arabic text in them. Arabic is a very special script language with two essential features:
- It is written from right to left.
- The characters change shape according to their surrounding characters.
So when you try to print Arabic text in an application – or a library – that doesn’t support Arabic you’re pretty likely to end up with something that looks like this:
We have two problems here, first, the characters are in the isolated form, which means that every character is rendered regardless of its surroundings, and second is that the text is written from left to right.
To solve the latter issue all we have to do is to use the Unicode bidirectional algorithm, which is implemented purely in Python in python-bidi. If you use it you’ll end up with something that looks like this:
The only issue left to solve is to reshape those characters and replace them with their correct shapes according to their surroundings.
I solved this issue more than four years ago in a small application that I wrote in Visual Basic, my solution was naive but it solved it well, anyway, a few days ago I faced the same problem – rendering Arabic text correctly – but on Android, and I searched and used the solution in this SO answer, which is pretty similar to the solution provided in Better Arabic Reshaper.
Today I ported the solution in Better Arabic Reshaper from Java to Python, tweaked it a little bit, and used it to successfully render Arabic text in PDF, and the result was:
Pretty cool right? Here is another test with English text in it some diacritics:
It looks fine! in Word the same text looks like this:
Amazing, now it is time for you to use the ported library along with python-bidi to solve those issues.
Usage
1 2 3 4 5 6 7 8 |
import arabic_reshaper from bidi.algorithm import get_display #... reshaped_text = arabic_reshaper.reshape(u'اللغة العربية رائعة') bidi_text = get_display(reshaped_text) pass_arabic_text_to_render(bidi_text) # <-- This function does not really exist #... |
The pass_arabic_text_to_render function here is an imaginary function, it is just here to say that the variable bidi_text is the variable that you would need to use in your code afterwards, for example to print it in PDF, or to write it in an Image, etc.
Demo
You can try an online demo of this script on my Python/Django site here: Arabic Reshaper Online.
Download
The source code is licensed under the GNU Public License (GPL).
Project on GitHub
Source code download from GitHub
Have fun واستمتع! 🙂
بارك الله فيك اخي عبدالله
أرجو ان يساعدني هذا في العديد من الامور التي لا تدعم اللغة العربية
أتمنى أن نتواصل على الايميل
وبارك فيك أخي خالد
يمكنك التواصل معي عن طريق بريدي الالكتروني الموجود في صفحة
About Me
Hello Abd,
Thank you for this *extremely* valuable port. Quick question, regarding “single letters”.
Your algorithm reshapes an isolated letter, such as ض (\u0636) into a shaped one : ﺿ (\uFEBF).
I don’t think this is correct (?) I consider adding a line of code at the very first line of the function “get_reshaped_word” to exclude 1-letter words. Would it make sense?
def get_reshaped_word(unshaped_word):
if len(unshaped_word) == 1: return unshaped_word ### <—– New
unshaped_word = replace_lam_alef(unshaped_word)
decomposed_word = DecomposedWord(unshaped_word)
…
Hi Louis,
Thanks for your reply and bug report, I fixed it now, it is on GitHub, you can download the library again.
بارك الله فيك
شغالة معايا تمام
تسلم
Thanks for this project, just wanted to inform that my problems regarding the error :
UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-4: ordinal not in range(128)
is solved by putting the following lines in arabic_reshaper.py :
import sys
reload(sys)
sys.setdefaultencoding(‘utf-8’)
In first, thanks for sharing this code, but i have a problem with the example that you provided.
pass_arabic_text_to_render(bidi_text)
NameError: name ‘pass_arabic_text_to_render’ is not defined
Welcome,
Your error is because this is actually not a method, it is just to say that you should instead of this line call your rendering method which will accept the Arabic text and render it, so it might be PDF printing, or simply PIL image or anything.
Cheers.
Assalam Alykum
Thank you brother for your great effort and sharing it , Now i can finally use beautiful arabic fonts in Linux for OpenERP arabic Reports.
which the arabic_reshaper.py was suggested as a part of solution for OpenERP arabic reports in https://github.com/barsi/openerp-rtl
i have noticed that there is vertical alignment Problem when generating the reports . the data is not vertically well aligned. am just asking is this issue related to the reshaper or to the Reportlab represntation for the arabic font.
note that before i use the solution in the link [ https://github.com/barsi/openerp-rtl ] some fonts were well aligned but they have the square thing issue , now they are ok but not well aligned vertically !!!
Wa Alaikom Al Salaam,
Thanks Razan for using this solution, the problem you’re having is due to the font you’re using I think, because I’ve used multiple fonts with Arabic text and Python and it went well without this vertical alignment problem, so you should experiment with multiple fonts till you find the best one for you, I tried Arial and Helvetica, try them if you want.
Good luck…
Thanks, i’ve tried various Fonts even Arial but still have the same problem, now i find that the alignment for Reports in Reportlab engine is in the paragraph.py file and that’s where comes the problem now am trying some tricks.
thanks again
Sorry I wasn’t of help to you.
Thanks in advance 🙂
Hello,
I’m using your module together with bidi and it’s clear the arabic text itself is correct and well wrapped whether in console or in text editor. However I need to render Arabic text properly as Paragraph entity in Reportlab, but I’m only facing a problem with word wrap (RTL text is wrapped, but with new line above, not under). How did you passed through this?
best regards and thanks for your effort
Marek
Hi Marek,
Can you show me an example on what is going on? The only problem I know when dealing with paragraphs is that when the text needs to be wrapped it’ll be messed up, and you need to break it into lines before you reshape it.
Hi Abd Allah,
thanks for the response. Let’s say I have this Arabic snippet:
إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة
In English this should mean something like: “If we take into account the nature of climate variability and inter-annual variability and those on long-term addition to the lack of accuracy of measurements and calculations used….”
Now I want to render it as Reportlab PDF doc:
arabic_text = u’إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة’
arabic_text = arabic_reshaper.reshape(arabic_text) # join characters
arabic_text = get_display(arabic_text) # change orientation by using bidi
pdf_file=open(‘disclaimer.pdf’,’w’)
pdf_doc = SimpleDocTemplate(pdf_file, pagesize=A4)
pdfmetrics.registerFont(TTFont(‘Arabic-normal’, ‘../fonts/KacstOne.ttf’))
style = ParagraphStyle(name=’Normal’, fontName=’Arabic-normal’, fontSize=12, leading=12. * 1.2)
style.alignment=TA_RIGHT
pdf_doc.build([Paragraph(arabic_text, style)])
pdf_file.close()
The result is here https://www.dropbox.com/s/gdyt6930jlad8id/disclaimer.pdf. You can see the text itself is correct and readable (at least for Google Translate :-)), but not wrapped as expected for RTL script.
best regards
Marek
Hi Abd Allah, just to clarify, what is going on: https://www.dropbox.com/s/0tn1977p7s9nlpi/arabic_bad_line_break_example.png
السلام عليكم
تظهر مشكله عند طباعه جمله طويله في اكتر من سطر
https://www.dropbox.com/s/foadw5ykw4n7m8m/Screenshot%20from%202013-11-27%2019%3A39%3A05.png
وعليكم السلام، هذه مشكلة عرض فقط ضمن الصفحة الموجودة على الرابط، لكن عند نسخ النص ولقصه في أحد التطبيقات ستجد أن المشكلة غير موجودة.
شكراً لك.
Thank you so much, really a wonderful job, thanks thanks thanks
You’re welcome 🙂 Thank you for the comment.
Thank you for this extremely valuable port, which helped generate printed registration rolls for over a million voters in Libya.
There is a minor bug with the lam-alef glyphs, which appears to be from the original Java package, as I have noted in GitHub issue #2.
We have also mirrored the RTL branch of reportlab to GitHub, in case others would like to use it without installing mercurial.
https://github.com/hnec-vr/reportlab-rtl
Thank you Josh for your reply and your bug report, I fixed it in GitHub.
Would you be able to send me the case study for your project that you used this script in? I would love to see how people are using it 🙂 My email is mpcabd {( AT )} G Mail [DOT] COM
All the best 🙂
Hi Josh & Abd Allah,
I am still confused how to break a block of Arabic text into lines – a reportlab’s paragraph. Starting from the right side of a page, the text should run to the left margin and continue on a new line bellow and right. This is not so, when I run the code against the reportlab-rtl branch. In PDF I got this:
و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة
إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ
instead of this:
إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة
This is the complete code (using reportlab-rtl, python-bidi and Abd Allah’s reshaper):
#encoding:UTF-8
from reportlab.lib.pagesizes import A4
from reportlab.platypus.doctemplate import SimpleDocTemplate
import arabic_reshaper # Abd Allah’s code
from bidi.algorithm import get_display # python_bidi
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.lib.styles import ParagraphStyle
from reportlab.lib.enums import TA_RIGHT
from reportlab.platypus.para import Paragraph
pdf_file=open(‘disclaimer_arabic.pdf’,’w’)
pdf_doc = SimpleDocTemplate(pdf_file, pagesize=A4)
arabic_text = u’إذا أخذنا بعين الإعتبار طبيعة تقلب المناخ و المتغيرات البينية السنوية و تلك على المدى الطويل إضافة إلى عدم دقة القياسات والحسابات المتبعة’
arabic_text = arabic_reshaper.reshape(arabic_text) # join characters
arabic_text = get_display(arabic_text) # change orientation by using bidi
#english_text = ‘If we take into account the nature of climate variability and inter-annual variability and those on long-term addition to the lack of accuracy of measurements and calculations used’
pdfmetrics.registerFont(TTFont(‘Arabic-normal’, ‘KacstOne.ttf’))
style = ParagraphStyle(name=’Normal’, fontName=’Arabic-normal’, fontSize=12, leading=12. * 1.2)
style.alignment=TA_RIGHT
pdf_doc.build([Paragraph(arabic_text, style)])
pdf_file.close()
best
Marek
Marek, in your ParagraphStyle make sure you set wordWrap=’RTL’. Otherwise, reportlab-rtl will act as if it’s LTR text.
I have tried it already (it looks very promising :-)), but unfortunately it has no effect, at least with my code…
Hi Bro.
How i can to install it ?!
thank you
You can install it using pip like this:
[sourcecode language=””]
$ pip install python-bidi
$ pip install https://github.com/mpcabd/python-arabic-reshaper/archive/master.zip
[/sourcecode]
Hi Josh & Abd Allah,
I was trying reportlab-rtl branch with reshaper and bidi. Reportlab’s paragraph doesn’t seem to be RTL enabled, because the block of Arabic text is not properly broken into lines automatically. The text running from right side is expected to continue on the new line bellow and right. This is not so, the new line appears above. Is this feature missing in reportlab Paragraph class for RTL text? It works for LTR.
all the best
Salam,
I have re-wrote your library to haxe language so that I can port it to php, javascript, c sharp, c++, java, but I didn’t re-write the method get_display
The question is, why shall I use get_display to reverse the text? I can simply reverse it simply by iterating through the letter via a simple loop, right?
Also, I have tried it, but I got this result: , so why the ALEF looks like LAM ? please see here:
https://drive.google.com/file/d/0BwzBTCo1-KJBSHA4c25GRXZzNkE/edit?usp=sharing
Salam Samir,
You should use the get_display to reverse the text, and yes I assume you can simply reverse iterate through the text but I think that the get_display does more than just that. See here.
And as for your little problem with the Aleph, it’s certainly showing as the end-form Aleph (U+FE8E) which I think is happening because of a mistake copying/porting the code, check your code that corresponds to this line and this line.
Good luck.
Thanks brother! I’ve solved the problem, it was a brackets issue, but I’ve noticed that when I use the shadda, I get doted circles,
I tried the text:
‘التّرجمة الفوريّة’
Which has shadda in each word, but here is the result:
https://drive.google.com/file/d/0BwzBTCo1-KJBbElZejRMY3JtRTg/edit?usp=sharing
Any advice? may be I have to use specific fonts that has fully unicode support for arabic? is that’s the problem, what fonts do you suggest?
Thank you!
Now this is unfortunately a known problem with diacritics, you won’t be able to use them properly when reversing the text 🙁
Here is the result after using a fully unicode font (Traditional Arabic Font)
https://drive.google.com/file/d/0BwzBTCo1-KJBUzNnNkQ3Z0lPVUE/edit?usp=sharing
there is an empty space under the shadda, what do you recommend?
This empty space is because the Shadda is a non-space character, my recommendation is to strip the text from diacritics (حركات) before you reverse it, because they won’t work properly after reversing the text.
But, how would I return them back to the stripped text?
That was why I’m saying unfortunately, because they won’t work, so you will have to strip the text and not return them.
Dear brother, I think you can simply add more unicode cases that would cover all the arabic letters with diacritics on, I think there is a unicode for each case,, but I am not sure…
No unfortunately, diacritics are separate characters in Unicode, and when you render a character with a diacritic on it, you’re actually rendering two different characters combined. See here for a list of all the Arabic characters in Unicode and here for a list of all the Arabic characters presentation forms in Unicode.
Then, there must be a way to “merge” the shadda glyph with the previous letter glyph, notice that the shadda glyph is empty at the buttom, if you place the shadda over the previous letter, then you will get the correct result, but, how would we do that?!
Sure that would be possible but it’s a rendering issue which is out of the scope of this script, as you can see that the script doesn’t render the text it just reshapes it so that you pass it reshaped to the script that renders it.
Man awesome, worked like magic! THUMBS UP
اولاً شكراً على المجهود الجيد
لاكن لم افهم لماذا تحتاج الى مكتبة بايثون بايدي
تستطيع ان تستغني عنها بإظافة هذا الكود
RTL = “”
for letter in reshaped_text:
RTL= letter + RTL
في النهاية
وهذا البرنامج كامل
# -*- coding: utf-8 -*-
import arabic_reshaper
reshaped_text = arabic_reshaper.reshape(u’اللغة العربية رائعة’)
RTL = “”
for letter in reshaped_text:
RTL= letter + RTL
print RTL
شكراً لك
في الواقع لا يمكنك الاستغناء عن مكتبة
python-bidi
واستبدالها بطريقتك، حيث أن طريقتك ستفشل في حال وجود أرقام أو محارف غير عربية ضمن النص.
شكراً لك على التعليق 🙂
Salam o Alykum.
Newbies can not undertand how to use this scrip. I have the same issues with OpenERP reports and also with Arabic fonts right/left side.
It will be very great if some can give a bit more detail how to use this scrip from the scratch on ubuntu 14.04.
thanks
Hi Zubair,
Please note that this is not a pip package, you have to download the script from github, and put it in your PYTHONPATH or next to your script. You will also need to install python-bidi which can be easily done through
pip install python-bidi
. After that you will be able to callarabic_reshaper.reshape
on your text and then pass it tobidi.algorithm.get_display
to make it ready to be printed or passed to some other library that will handle the rendering of the text.Regards.
Dear Abd,
Thanks for the reshaper. I do not know Arabic. I was testing the arabic reshaper. I would like to get some feedback if possible.
My understanding is that the proper way to write yeh followed by teh would be “یت” . My question is what should arabic reshaper produce once it applied to یﺖ
Hi Kursat,
The letter ى (Alef Maksura) is a letter that is used only in its isolated and final form only, so no character should be after it. Check the Unicode forms of it here: http://www.fileformat.info/info/unicode/char/search.htm?q=ARABIC+LETTER+ALEF+MAKSURA&preview=entity
The only problem is that in Egypt they use this letter as a replacement for the letter ي (Yeh) in the final form, this is not proper Arabic, it’s only in the Egyptian colloquial language, which is, IMO unfortunately, used a lot on the internet that even it has its own Wikipedia language (http://arz.wikipedia.org/).
So to summarize, the reshaper should show these characters separated.
An example would be trying to reshape “ذهبت إلىمنزلي” (I went home) which has a mistake of not separating إلى from منزلي, and it will reshape to
, while the rendering engine on my Chrome shows it like
which is not proper Arabic.
I hope this answers your question.
Dear Abd,
Thank your for your answer. I am not very good with git or github.
So, I would like to point a line of code where YEH and ALEF MAKSURA gets mixed up in your code in ARABIC_GLYPHS:
u’\u06CC’ : [u’\u06CC’, u’\uFEEF’, u’\uFEF3′, u’\uFEF4′, u’\uFEF0′, 4]
FEEF and FEF0 are codes for ALEF MAKSURA.
Instead this line must be:
u’\u06CC’ : [u’\u06CC’, u’\uFEF1′, u’\uFEF3′, u’\uFEF4′, u’\uFEF2′, 4]
Thanks,
k.
Hi Aker,
Sorry for the late response, I was a bit busy.
The Alef Maksura character used is U+0649, and its only two forms used are U+FEFF (isolated) and U+FEF0 (final), while for the Yeh character it is U+064A, and its four forms used are U+FEF1 (isolated), U+FEF3 (initial), U+FEF4 (medial), U+FEF2 (final). The character you’re referring to U+06CC is not the Arabic Yeh or the Arabic Alef Maksura, it is the Farsi Yeh which has four forms U+FEEF (isolated) same as Alef Maksura, U+FEF3 (initial) same as Yeh, U+FEF4 (medial) same as Yeh, U+FEF0 (final) same as Alef Maksura.
I hope this clears out the confusion.
Regards.
Dear Abdullah,
I’m a newbie on using reportLab and I’m trying to use your re-shaper code for creating a Urdu document. My full code is as follow:
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
import arabic_reshaper
from bidi.algorithm import get_display
pdfmetrics.registerFont(TTFont(‘Urdu’,’JameelNooriNastaleeq.ttf’)) # Urdu Nastaleeq font
c = canvas.Canvas(filename = ‘test2.pdf’,pagesize=’A4′)
x = 250
y = 500
text = u’عدنان الحسن’
reshaped_text = arabic_reshaper.reshape(text)
bidi_text = get_display(reshaped_text)
c.setFont(‘Urdu’,30)
c.drawString(x,y,bidi_text)
c.showPage()
c.save()
But, unfortunately, I’m unable to get anything in the pdf file generated by above code. The output from my code can be seen on following link:
https://www.dropbox.com/s/pa252v1858xldn2/test2.pdf?dl=0
I would be grateful if you can suggest me a solution to this problem.
Thanks in advance!
Adnan
Adnan Bhai Salam o Alykum.
http://youtu.be/ytJJlwHxAwo
Thanks Zubair,
Can you share the document, so that I can get the web-links easily.
regards,
Adnan
Okay, I managed to repeat the process you showed in the video, but still, I wasn’t able to get anything in pdf file. It’s an empty pdf file.
Salam o Alykum Mr. Adnan, this link was there in video description
http://www.4shared.com/office/RF3PMFlOba/OpenERP_-_ARABIC_Support_Teste.html
Source
here is the source
https://bitbucket.org/openerparabia/openerp-arabic-support
Hi Abdullah tanks for your library and tanks for supporting “گچپژ” in your library. 🙂
Thanks a lot Abdallah 🙂 , I have searched many times for the issue and finally it works fine with your solution.
Works great! Thanks a lot
Thank you Abdullah. I used your library to create an Arabic word cloud using wordcloud Python library. If you create a post to explain the methodology of your code, that would be appreciated. Thanks.
Using matplotlib with python, it displays almost everything correctly, except Allah. In that case you see a rectangle