Nov 15, 20222 min read

How to create good training data for working with AI tools

The importance of accurate (diverse) training data has already been touched upon in the previous blog entry and will be discussed in more detail here.

a small extract from my created training data

Marcus du Sautoy, Professor of Mathematics at Oxford University, refers to data as a digital oilfield, emphasizing how valuable and important data has become in our time. It also grounds the ongoing AI revolution; "Big Data" is a basic requirement for AI tools . The amount of data we humans produce every day is constantly increasing, and we need algorithms to evaluate these data sets and draw meaningful conclusions from them. In doing so, algorithms do not "understand" the data correlations, but calculate probabilities. Humans can fuse experiences but have a low rate of interaction with data and need algorithms to compensate.

Labeled data

In supervised learning, algorithms need labeled data sets so that they can identify relationships (underlying patterns) between individual data. Thus, the data must be manually classified by humans. They then have to tag images with associated labels, for example. This process is usually the most time-consuming when working with AI tools. Humans thus train models in machine vision and can thus also pass their own individual way of looking at the data to the models . In unsupervised learning, you let models learn from raw data. Disagree!!!

Problematic training data

A weakness in machine learning is the misinterpretation of correlations as causalities , this can only be avoided with a larger and more diverse data set. Also, algorithms tend to ignore what is unusual or irregular in the data and only pay attention to it when it occurs repeatedly. Working with data also raises ethical and legal issues that are only briefly touched upon in this blog entry. Data bias (e.g., hardly any images of POCs in facial recognition software) discriminates against people, and data theft due to the amount of data needed repeatedly violates the right to one's own image (Lensa App).

How I created my datasets

The most important step in training AI Models yourself is to assemble the training dataset. In my case, this means I have to make sure that all images are stored in the same data format (PNG) and with the same resolution (512x512). In my training data, the letter is always in black on a white background, so I can eventually continue working with the generated letters. Also, of course, the position of the letter on the canvas is important and should always be on a baseline at best.

I downloaded and installed a few thousand fonts (Serif, Sans Serif as well as Handwriting) from websites like "Google Fonts" or "1001Fonts" on my computer. Then, in InDesign, I placed the letter A in a different font and style on each page. With the different font categories (Serif, Sans Serif and Handwriting) the dataset is hopefully diverse enough to give the AI Model enough leeway when generating.

ai.type.ex

How to create good training data for working with AI tools

a small extract from my created training data

Labeled data

Problematic training data

How I created my datasets

Recent Posts

Comments

Type me:
frida.hoeft@aol.com

ai.type.ex

How to create good training data for working with AI tools

a small extract from my created training data

Labeled data

Problematic training data

How I created my datasets

Recent Posts

Comments

Type me: frida.hoeft@aol.com

Type me:
frida.hoeft@aol.com