0% found this document useful (0 votes)
93 views13 pages

ChatGPT - MyLearning On Dataset Aksara Jawa

on Dataset Aksara Jawa

Uploaded by

tbudiono
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views13 pages

ChatGPT - MyLearning On Dataset Aksara Jawa

on Dataset Aksara Jawa

Uploaded by

tbudiono
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Dataset Aksara Jawa by ChatGPT

1
Introduction

The purpose of this ChatGPT thread is for learning “Dataset Aksara Jawa”.

Method of Learning
By asking question.

2
Q1: Give examples of simple data set of Aksara Jawa that can
be used later for training machine learning algorithm for
Javanese Character Recognition?

Here’s a simple dataset of Aksara Jawa (Javanese script) to help you get
started with training a machine learning algorithm for Javanese Character
Recognition. This dataset includes a mix of basic characters, diacritics, and
numbers.

Simple Dataset Format


Structure:
 Images of individual characters saved in directories corresponding to their
labels.
 Labels represented by Unicode or human-readable names.
Dataset Directory Example:
dataset/
training/
ꦲ/ # Ha
img1.png
img2.png
ꦏ/ # Ka
img1.png
img2.png
validation/
ꦲ/ # Ha
img1.png
img2.png
ꦏ/ # Ka
img1.png
img2.png
labels.csv # Optional

Example CSV File for Labels:


image_path,label
training/ꦲ/img1.png,ꦲ
training/ꦏ/img1.png,ꦏ
validation/ꦲ/img1.png,ꦲ
validation/ꦏ/img1.png,ꦏ

Example Characters
Basic Javanese Characters

3
Unicod Charact Nam
e er e

U+A98
ꦲ Ha
0

U+A98
ꦏ Ka
1

U+A98
ꦒ Ga
2

U+A98
ꦤ Na
3

U+A98
ꦥ Pa
4

Diacritics (Sandhangan)

Unicod Charact
Name
e er

U+A9B
ꦴ Tarung
4

U+A9B Pangko

5 n

U+A9B
ꦶ Pepet
6

U+A9B
ꦷ Cakra
7

Numbers

Unicod Charact Nam


e er e

U+A9F
꧐ 0
0

U+A9F
꧑ 1
1

U+A9F
꧒ 2
2

U+A9F
꧓ 3
3

4
Example Training Dataset
Here’s a small sample with synthetic images created using Python’s PIL (Pillow)
library and the Noto Sans Javanese font.
Dataset Example with Labels:

Imag Label Label


e (Unicode) (Name)

ꦲ Ha

ꦏ Ka

ꦒ Ga

ꦤ Na

꧑ 1

Python Code to Generate Dataset


from PIL import Image, ImageDraw, ImageFont
import os

# Characters and labels


javanese_characters = [
{"char": "ꦲ", "label": "Ha"},
{"char": "ꦏ", "label": "Ka"},
{"char": "ꦒ", "label": "Ga"},
{"char": "ꦤ", "label": "Na"},
{"char": "ꦥ", "label": "Pa"}
]
output_dir = "dataset/training/"
font_path = "NotoSansJavanese-Regular.ttf"

# Create directories
os.makedirs(output_dir, exist_ok=True)
for char_info in javanese_characters:
char_dir = os.path.join(output_dir, char_info["label"])
os.makedirs(char_dir, exist_ok=True)

# Generate images
for char_info in javanese_characters:
for i in range(10): # Generate 10 images per character
img = Image.new("L", (32, 32), color=255) # Create white canvas
draw = ImageDraw.Draw(img)
font = ImageFont.truetype(font_path, 28)

5
draw.text((4, 0), char_info["char"], font=font, fill=0) # Draw
character

img.save(f"{output_dir}/{char_info['label']}/{char_info['label']}_{i}.png")

Tips for Expanding Dataset


1. Handwritten Samples:
o Collect handwritten samples using drawing tablets or scanning
handwritten text.
2. Augmentation:
o Rotate, scale, or distort images to simulate real-world conditions.

3. Diacritics Combination:
o Generate samples with base characters combined with diacritics
(e.g., ꦏꦶ).

How to Use the Dataset


1. Preprocessing:
o Normalize images (resize, grayscale, or binary format).

2. Model Training:
o Use a Convolutional Neural Network (CNN) for character recognition.

3. Evaluation:
o Validate on unseen combinations or real-world samples.

This simple dataset can serve as a starting point for training models to recognize
Javanese script. Let me know if you need further assistance with dataset
generation, augmentation, or model training!

6
Q2: Give examples of simple data set of Aksara Jawa in the
form of real image that I can download that can be used later
for training machine learning algorithm for Javanese Character
Recognition.

To develop a machine learning model for Javanese Character Recognition,


utilizing existing datasets can significantly expedite the process. Here are some
publicly available datasets containing real images of Aksara Jawa:

1. Hanacaraka Dataset
 Description: This dataset comprises digital handwritten images of the 20
basic Javanese characters, commonly known as Hanacaraka. Each
character class contains approximately 75 images, with resolutions below
500x500 pixels.
 Access: Available on GitHub.
 Usage: Suitable for training models focused on recognizing individual
handwritten Javanese characters.

2. Baksara Dataset
 Description: The Baksara dataset includes 164 images of Aksara Jawa
characters, annotated for object detection tasks. It also provides a pre-
trained model and an API for inference.
 Access: Hosted on Roboflow Universe.
 Usage: Ideal for developing object detection models that identify and
locate Javanese characters within images.

3. Aksara Jawa Object Detection Dataset


 Description: This dataset contains 85 images of Hanacaraka characters,
annotated for object detection purposes. It supports various formats
compatible with popular machine learning frameworks.
 Access: Available on Roboflow Universe.
 Usage: Beneficial for training models that detect and classify Javanese
script characters in diverse contexts.

4. Javanese Handwriting Object Detection Dataset

7
 Description: Comprising 1,464 images, this dataset focuses on
handwritten Javanese characters, annotated for object detection. It
includes a pre-trained model and API for immediate testing.
 Access: Accessible via Roboflow Universe.
 Usage: Suitable for projects aiming to recognize and locate handwritten
Javanese characters in images.

5. Aksara Jawa Custom Dataset


 Description: This dataset offers 2,154 training images and 480
evaluation images of handwritten Javanese characters. Each image is in
RGB format with a resolution of 224x224 pixels.
 Access: Referenced in academic publications; details can be found in
related research articles.
 Usage: Appropriate for training and evaluating models on standardized
handwritten Javanese character images.

6. Hanacaraka AI Project Dataset


 Description: Utilized in the Hanacaraka AI project, this dataset contains
images of handwritten Javanese characters. The project includes models
trained on this dataset, achieving up to 96% validation accuracy.
 Access: Details and access provided through the project's GitHub
repository.
 Usage: Useful for understanding model training processes and
performance benchmarks on Javanese character recognition tasks.

Recommendations for Use:


 Data Augmentation: Enhance model robustness by applying
transformations such as rotation, scaling, and noise addition to the
images.
 Preprocessing: Standardize image sizes and formats to align with the
input requirements of your machine learning model.
 Evaluation: Utilize the provided evaluation datasets to assess model
performance and ensure generalization to unseen data.
By leveraging these datasets, you can effectively train and evaluate machine
learning models for Javanese Character Recognition, facilitating advancements in
the digitization and preservation of Javanese script.

8
Q3: What are the criteria of good dataset for javanese
character recognition?

A good dataset for Javanese character recognition must meet several key
criteria to ensure the effective training, validation, and testing of machine
learning models. These criteria encompass aspects like diversity, quality, labeling
accuracy, and scalability. Below are the essential criteria:

1. Completeness
 All Relevant Characters:
o Include all basic characters (e.g., ꦲ to ꦳), numbers (e.g., ꧐ to ꦿ),
diacritics (sandhangan), and special symbols.
o Account for character combinations, such as base letters with
diacritics.
 Balanced Representation:
o Ensure that all classes (characters) have a comparable number of
samples to prevent model bias.

2. Diversity
 Styles:
o Include multiple styles:

 Printed characters (using fonts like Noto Sans Javanese).


 Handwritten characters from various individuals to account
for variability in handwriting.
 Sources:
o Use diverse sources such as scanned documents, handwritten
notes, and digital text.
 Variability:
o Include variations in:

 Font sizes and styles.


 Lighting conditions.
 Background textures.

3. Quality

9
 Image Resolution:
o Ensure all images are of sufficient resolution (e.g., at least 28x28
pixels for simple models, 224x224 pixels for complex models).
 Clarity:
o Avoid blurry or distorted images.

o Use preprocessing (e.g., noise reduction, contrast adjustment) to


improve image quality.
 Uniform Dimensions:
o Standardize image dimensions and formats to simplify model
training.

4. Labeling Accuracy
 Correct Labels:
o Each image must be labeled accurately with the corresponding
Unicode or a descriptive label (e.g., ꦲ, ꦏ, ꦩ).
 Consistent Labeling:
o Use a uniform format for labels (e.g., Unicode or text-based).

 Diacritic Annotation:
o Clearly label combinations of base characters and diacritics.

5. Scalability
 Size:
o Include a sufficient number of samples per character (e.g., hundreds
or thousands) to ensure model generalization.
 Expandable:
o Design the dataset structure to allow easy addition of new samples
or classes.

6. Noise and Disturbances


 Include realistic challenges such as:
o Handwritten samples with varying levels of neatness.

o Printed text with distortions, smudges, or noise (e.g., low-quality


scans).
o Images with overlapping or connected characters.

10
7. Data Augmentation
 Support augmentation techniques to increase diversity without manual
data collection:
o Rotation, scaling, and skewing.

o Brightness and contrast adjustments.

o Adding noise or simulating handwritten imperfections.

8. Balanced Dataset Split


 Training, Validation, and Test Sets:
o Split the dataset appropriately (e.g., 70% training, 15% validation,
15% testing).
 Avoid Data Leakage:
o Ensure there is no overlap between training, validation, and test
sets.

9. Open Accessibility
 Public Availability:
o If sharing the dataset, ensure it complies with copyright laws and
privacy concerns.
 Documentation:
o Provide clear documentation explaining the dataset structure,
format, and labeling system.

10. Ethical Considerations


 Representation:
o Include characters written by individuals from different
demographics to avoid biases.
 Cultural Respect:
o Properly acknowledge the cultural significance of the Javanese
script.

Checklist for a Good Dataset

11
Criteria Requirements

All characters, numbers, diacritics, and combinations


Completeness
included.

Diversity Variations in style, source, and appearance.

Clear, high-resolution images with uniform


Quality
dimensions.

Labeling Accuracy Accurate, consistent labels for each character.

Scalability Large, expandable dataset with balanced classes.

Noise and Realistic challenges like handwriting variability or


Disturbance noise.

Data Augmentation Support transformations for diversity.

Proper distribution into training, validation, and test


Balanced Split
sets.

Accessibility Publicly available with clear documentation.

Ethical
Inclusive and respectful of cultural context.
Considerations

Conclusion
A good dataset for Javanese character recognition must comprehensively cover
all characters, include diverse sources and styles, and maintain high-quality
images and accurate labels. These criteria ensure that the resulting machine
learning models perform effectively in real-world applications.
Would you like assistance in creating or finding such a dataset?

12
13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy