ChatGPT - MyLearning On Dataset Aksara Jawa
ChatGPT - MyLearning On Dataset Aksara Jawa
1
Introduction
The purpose of this ChatGPT thread is for learning “Dataset Aksara Jawa”.
Method of Learning
By asking question.
2
Q1: Give examples of simple data set of Aksara Jawa that can
be used later for training machine learning algorithm for
Javanese Character Recognition?
Here’s a simple dataset of Aksara Jawa (Javanese script) to help you get
started with training a machine learning algorithm for Javanese Character
Recognition. This dataset includes a mix of basic characters, diacritics, and
numbers.
Example Characters
Basic Javanese Characters
3
Unicod Charact Nam
e er e
U+A98
ꦲ Ha
0
U+A98
ꦏ Ka
1
U+A98
ꦒ Ga
2
U+A98
ꦤ Na
3
U+A98
ꦥ Pa
4
Diacritics (Sandhangan)
Unicod Charact
Name
e er
U+A9B
ꦴ Tarung
4
U+A9B Pangko
ꦵ
5 n
U+A9B
ꦶ Pepet
6
U+A9B
ꦷ Cakra
7
Numbers
U+A9F
꧐ 0
0
U+A9F
꧑ 1
1
U+A9F
꧒ 2
2
U+A9F
꧓ 3
3
4
Example Training Dataset
Here’s a small sample with synthetic images created using Python’s PIL (Pillow)
library and the Noto Sans Javanese font.
Dataset Example with Labels:
ꦲ Ha
ꦏ Ka
ꦒ Ga
ꦤ Na
꧑ 1
# Create directories
os.makedirs(output_dir, exist_ok=True)
for char_info in javanese_characters:
char_dir = os.path.join(output_dir, char_info["label"])
os.makedirs(char_dir, exist_ok=True)
# Generate images
for char_info in javanese_characters:
for i in range(10): # Generate 10 images per character
img = Image.new("L", (32, 32), color=255) # Create white canvas
draw = ImageDraw.Draw(img)
font = ImageFont.truetype(font_path, 28)
5
draw.text((4, 0), char_info["char"], font=font, fill=0) # Draw
character
img.save(f"{output_dir}/{char_info['label']}/{char_info['label']}_{i}.png")
3. Diacritics Combination:
o Generate samples with base characters combined with diacritics
(e.g., ꦏꦶ).
2. Model Training:
o Use a Convolutional Neural Network (CNN) for character recognition.
3. Evaluation:
o Validate on unseen combinations or real-world samples.
This simple dataset can serve as a starting point for training models to recognize
Javanese script. Let me know if you need further assistance with dataset
generation, augmentation, or model training!
6
Q2: Give examples of simple data set of Aksara Jawa in the
form of real image that I can download that can be used later
for training machine learning algorithm for Javanese Character
Recognition.
1. Hanacaraka Dataset
Description: This dataset comprises digital handwritten images of the 20
basic Javanese characters, commonly known as Hanacaraka. Each
character class contains approximately 75 images, with resolutions below
500x500 pixels.
Access: Available on GitHub.
Usage: Suitable for training models focused on recognizing individual
handwritten Javanese characters.
2. Baksara Dataset
Description: The Baksara dataset includes 164 images of Aksara Jawa
characters, annotated for object detection tasks. It also provides a pre-
trained model and an API for inference.
Access: Hosted on Roboflow Universe.
Usage: Ideal for developing object detection models that identify and
locate Javanese characters within images.
7
Description: Comprising 1,464 images, this dataset focuses on
handwritten Javanese characters, annotated for object detection. It
includes a pre-trained model and API for immediate testing.
Access: Accessible via Roboflow Universe.
Usage: Suitable for projects aiming to recognize and locate handwritten
Javanese characters in images.
8
Q3: What are the criteria of good dataset for javanese
character recognition?
A good dataset for Javanese character recognition must meet several key
criteria to ensure the effective training, validation, and testing of machine
learning models. These criteria encompass aspects like diversity, quality, labeling
accuracy, and scalability. Below are the essential criteria:
1. Completeness
All Relevant Characters:
o Include all basic characters (e.g., ꦲ to ꦳), numbers (e.g., ꧐ to ꦿ),
diacritics (sandhangan), and special symbols.
o Account for character combinations, such as base letters with
diacritics.
Balanced Representation:
o Ensure that all classes (characters) have a comparable number of
samples to prevent model bias.
2. Diversity
Styles:
o Include multiple styles:
3. Quality
9
Image Resolution:
o Ensure all images are of sufficient resolution (e.g., at least 28x28
pixels for simple models, 224x224 pixels for complex models).
Clarity:
o Avoid blurry or distorted images.
4. Labeling Accuracy
Correct Labels:
o Each image must be labeled accurately with the corresponding
Unicode or a descriptive label (e.g., ꦲ, ꦏ, ꦩ).
Consistent Labeling:
o Use a uniform format for labels (e.g., Unicode or text-based).
Diacritic Annotation:
o Clearly label combinations of base characters and diacritics.
5. Scalability
Size:
o Include a sufficient number of samples per character (e.g., hundreds
or thousands) to ensure model generalization.
Expandable:
o Design the dataset structure to allow easy addition of new samples
or classes.
10
7. Data Augmentation
Support augmentation techniques to increase diversity without manual
data collection:
o Rotation, scaling, and skewing.
9. Open Accessibility
Public Availability:
o If sharing the dataset, ensure it complies with copyright laws and
privacy concerns.
Documentation:
o Provide clear documentation explaining the dataset structure,
format, and labeling system.
11
Criteria Requirements
Ethical
Inclusive and respectful of cultural context.
Considerations
Conclusion
A good dataset for Javanese character recognition must comprehensively cover
all characters, include diverse sources and styles, and maintain high-quality
images and accurate labels. These criteria ensure that the resulting machine
learning models perform effectively in real-world applications.
Would you like assistance in creating or finding such a dataset?
12
13