|
| 1 | +--- |
| 2 | +title: .NET 8 & Tesseract OCR on Amazon Linux 2023 EC2 | Syncfusion |
| 3 | +description: Install & configure .NET 8, Tesseract OCR on Amazon Linux 2023 EC2 to perform OCR on PDFs & images using Syncfusion .NET OCR library. |
| 4 | +control: PDF |
| 5 | +documentation: UG |
| 6 | +keywords: Assemblies |
| 7 | +--- |
| 8 | + |
| 9 | +# Perform OCR with Tesseract on Amazon Linux EC2 using .NET application |
| 10 | + |
| 11 | +The [Syncfusion<sup>®</sup> .NET OCR library](https://www.syncfusion.com/document-processing/pdf-framework/net/pdf-library/ocr-process) is used to extract text from scanned PDFs and images in the Linux application with the help of Google's [Tesseract](https://github.com/tesseract-ocr/tesseract) Optical Character Recognition engine. |
| 12 | + |
| 13 | +This guide provides a detailed, step-by-step process for installing Tesseract OCR and its essential dependencies directly on an Amazon Linux 2023 (AL2023) EC2 instance. This approach allows you to deploy .NET applications that utilize OCR functionalities, such as those relying on Syncfusion PDF Processing with Tesseract, without the need for Docker containers. |
| 14 | + |
| 15 | +## Pre-requisites |
| 16 | + |
| 17 | +Before you begin, ensure you have: |
| 18 | + |
| 19 | +* An active Amazon Linux 2023 (AL2023) EC2 instance. |
| 20 | +* SSH access to your EC2 instance. |
| 21 | +* Basic familiarity with Linux command-line operations. |
| 22 | + |
| 23 | + |
| 24 | +## Installation steps for .NET 8 and Tesseract OCR on Amazon Linux 2023 EC2 |
| 25 | + |
| 26 | +Execute the following commands sequentially in your EC2 instance's terminal. It is recommended to run these commands from the `/home/ec2-user` directory unless specified otherwise. |
| 27 | + |
| 28 | +Step 1: **Update System Packages**: It's crucial to start by ensuring all existing packages on your EC2 instance are up to date |
| 29 | + |
| 30 | +{% highlight c# tabtitle="C#" %} |
| 31 | + |
| 32 | +sudo yum update -y |
| 33 | + |
| 34 | +{% endhighlight %} |
| 35 | + |
| 36 | +Step 2: **Add Microsoft Package Repository** : To install the .NET SDK, you need to add the Microsoft package repository for Fedora 39, which AL2023 is based on. |
| 37 | + |
| 38 | +{% highlight c# tabtitle="C#" %} |
| 39 | + |
| 40 | +sudo curl -o /etc/yum.repos.d/packages-microsoft-com-prod.repo https://packages.microsoft.com/config/fedora/39/prod.repo |
| 41 | + |
| 42 | +{% endhighlight %} |
| 43 | + |
| 44 | +Step 3: **Install .NET SDK**: Install the .NET 8.0 SDK using yum. This is essential for building and running your .NET application. |
| 45 | + |
| 46 | +{% highlight c# tabtitle="C#" %} |
| 47 | + |
| 48 | +sudo yum install -y dotnet-sdk-8.0 |
| 49 | + |
| 50 | +{% endhighlight %} |
| 51 | + |
| 52 | +Step 4: **Verify .NET SDK Installation** : Confirm that the .NET SDK has been installed correctly by checking its version. |
| 53 | + |
| 54 | +{% highlight c# tabtitle="C#" %} |
| 55 | + |
| 56 | +sudo dotnet --version |
| 57 | + |
| 58 | +{% endhighlight %} |
| 59 | + |
| 60 | +You should see output similar to 8.0.x (where x is the patch version). |
| 61 | + |
| 62 | +Step 5: **Install `libgdiplus` Package** : `libgdiplus` is a Mono implementation of the GDI+ API, often required by .NET applications for image processing functionalities. Run these commands completely in a single block from the `/home/ec2-user` directory. |
| 63 | + |
| 64 | +{% highlight c# tabtitle="C#" %} |
| 65 | + |
| 66 | +sudo yum groupinstall "Development Tools" -y |
| 67 | +sudo yum install autoconf automake libtool gettext libjpeg-turbo-devel libpng-devel giflib-devel freetype-devel -y |
| 68 | + |
| 69 | +git clone https://github.com/mono/libgdiplus.git |
| 70 | +cd libgdiplus |
| 71 | +./autogen.sh |
| 72 | +make |
| 73 | +sudo make install |
| 74 | + |
| 75 | +{% endhighlight %} |
| 76 | + |
| 77 | +Step 6: **Install `leptonica` Package** : Leptonica is a software library that forms a core dependency for Tesseract OCR, providing image processing and analysis tools. Run these commands completely in a single block from the `/home/ec2-user` directory. |
| 78 | + |
| 79 | +{% highlight c# tabtitle="C#" %} |
| 80 | + |
| 81 | +sudo yum groupinstall "Development Tools" -y |
| 82 | +sudo yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel -y |
| 83 | +wget http://www.leptonica.org/source/leptonica-1.82.0.tar.gz |
| 84 | +tar -xzf leptonica-1.82.0.tar.gz |
| 85 | +cd leptonica-1.82.0 |
| 86 | +./configure |
| 87 | +make |
| 88 | +sudo make install |
| 89 | +sudo ldconfig |
| 90 | + |
| 91 | +{% endhighlight %} |
| 92 | + |
| 93 | +Step 7: **Install `libpng` Package** : `libpng` is the official PNG reference library, critical for handling PNG image formats often used in OCR processes. Although `libpng-devel` was installed, building from source ensures the correct version/setup sometimes. |
| 94 | + |
| 95 | +{% highlight c# tabtitle="C#" %} |
| 96 | + |
| 97 | +sudo yum groupinstall "Development Tools" -y |
| 98 | +sudo yum install gcc make wget tar -y |
| 99 | + |
| 100 | +cd /tmp # Temporarily move to /tmp for build |
| 101 | +wget https://download.sourceforge.net/libpng/libpng-1.6.40.tar.gz |
| 102 | +tar -xzf libpng-1.6.40.tar.gz |
| 103 | +cd libpng-1.6.40 |
| 104 | +./configure |
| 105 | +make |
| 106 | +sudo make install |
| 107 | + |
| 108 | +{% endhighlight %} |
| 109 | + |
| 110 | +Step 8: **Create Symbolic Link for libdl** : The .NET runtime often expects `libdl.so` to be directly accessible from its shared library path. You need to create a symbolic link from its actual location to the .NET runtime directory. |
| 111 | + |
| 112 | +First, find the path of your installed .NET runtime version: |
| 113 | + |
| 114 | +{% highlight c# tabtitle="C#" %} |
| 115 | + |
| 116 | +dotnet --list-runtimes |
| 117 | + |
| 118 | +{% endhighlight %} |
| 119 | + |
| 120 | +The output will be similar to this (note the version number might differ slightly): |
| 121 | + |
| 122 | +{% highlight c# tabtitle="C#" %} |
| 123 | + |
| 124 | +Microsoft.AspNetCore.App 8.0.x [/usr/lib64/dotnet/shared/Microsoft.AspNetCore.App] |
| 125 | +Microsoft.NETCore.App 8.0.x [/usr/lib64/dotnet/shared/Microsoft.NETCore.App] |
| 126 | + |
| 127 | +{% endhighlight %} |
| 128 | + |
| 129 | +Now, create the symbolic link. `Replace 8.0.17` with the exact version number from your `dotnet --list-` output for `Microsoft.NETCore.App`. |
| 130 | + |
| 131 | +{% highlight c# tabtitle="C#" %} |
| 132 | + |
| 133 | +sudo ln -s /usr/lib64/libdl.so.2 /usr/lib64/dotnet/shared/Microsoft.NETCore.App/8.0.17/libdl.so |
| 134 | + |
| 135 | +{% endhighlight %} |
| 136 | + |
| 137 | +Step 9: Create Symbolic Link for `libpng16` |
| 138 | + |
| 139 | +Create a symbolic link for the `libpng16` package to ensure it's accessible in common library paths. |
| 140 | + |
| 141 | +{% highlight c# tabtitle="C#" %} |
| 142 | + |
| 143 | +sudo ln -s /usr/local/lib/libpng16.so.16 /lib64/libpng16.so.16 |
| 144 | + |
| 145 | +{% endhighlight %} |
| 146 | + |
| 147 | +Step 10: Create Symbolic Link for `liblept` |
| 148 | + |
| 149 | +Similarly, create a symbolic link for the `liblept` package (Leptonica library). |
| 150 | + |
| 151 | +{% highlight c# tabtitle="C#" %} |
| 152 | + |
| 153 | +sudo ln -s /usr/local/lib/liblept.so.5 /lib64/liblept.so.5 |
| 154 | + |
| 155 | +{% endhighlight %} |
| 156 | + |
| 157 | +Step 11: **Implement the Project Code** : To set up your project's OCR functionality, consult the comprehensive guide on [Perform OCR in Linux](https://help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/linux). |
| 158 | + |
| 159 | +Step 12: **Set Permissions for Tesseract Binaries** : Navigate to your application's Tesseract binaries directory and set read, write, and execute permissions. This is crucial for the OCR process to function correctly. Important: You need to change `bin/Debug/net8.0/runtimes/linux/native` to the actual path where your Syncfusion Tesseract binaries (e.g., `libSyncfusionTesseract.so, liblept1753.so`) are located within your published application. |
| 160 | + |
| 161 | +{% highlight c# tabtitle="C#" %} |
| 162 | + |
| 163 | +sudo chmod 777 libSyncfusionTesseract.so |
| 164 | +sudo chmod 777 liblept1753.so |
| 165 | + |
| 166 | +{% endhighlight %} |
| 167 | + |
| 168 | +Step 13: **Build and Run Your .NET Application** : Finally, build and publish your .NET application, and then run it. |
| 169 | + |
| 170 | +{% highlight c# tabtitle="C#" %} |
| 171 | + |
| 172 | +sudo dotnet build |
| 173 | + |
| 174 | +sudo dotnet publish -c Release -o ./publish |
| 175 | + |
| 176 | +cd publish |
| 177 | + |
| 178 | +sudo dotnet PdfProcessingApi.dll --urls "http://0.0.0.0:5000" |
| 179 | + |
| 180 | +{% endhighlight %} |
| 181 | + |
| 182 | +Remember to replace `PdfProcessingApi.dll` with the actual name of your application's entry-point DLL. |
| 183 | + |
| 184 | +By executing the program, you will get the PDF document as follows. The output will be saved in parallel to the program.cs file. |
| 185 | + |
| 186 | + |
| 187 | +A complete working sample can be downloaded from [Github](https://github.com/SyncfusionExamples/OCR-csharp-examples/tree/master/Linux). |
| 188 | + |
| 189 | + |
| 190 | + |
0 commit comments