Deep Learning-Based End-to-End Language Development Screening for Children Using Linguistic Knowledge

Oh, Byoung-Doo; Lee, Yoon-Kyoung; Kim, Jong-Dae; Park, Chan-Young; Kim, Yu-Seop

doi:10.3390/app12094651

Open AccessArticle

Deep Learning-Based End-to-End Language Development Screening for Children Using Linguistic Knowledge

¹

Department of Convergence Software, Hallym University, Chuncheon 24252, Gangwon-do, Korea

²

Bio-IT Center, Hallym University, Chuncheon 24252, Gangwon-do, Korea

³

Division of Speech Pathology and Audiology, Hallym University, Chuncheon 24252, Gangwon-do, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(9), 4651; https://0-doi-org.brum.beds.ac.uk/10.3390/app12094651

Submission received: 10 April 2022 / Revised: 2 May 2022 / Accepted: 4 May 2022 / Published: 6 May 2022

(This article belongs to the Special Issue Selected Papers from IMETI 2021)

Download

Browse Figures

Versions Notes

Abstract

:

Language development is inextricably linked to the development of fundamental human abilities. A language problem can result from abnormal language development in childhood, which has a severe impact on other elements of life. As a result, early treatment of language impairments in children is critical. However, because it is difficult for parents to identify atypical language development in their children, optimal diagnosis and treatment periods are frequently missed. Furthermore, the diagnosis process necessitates a significant amount of time and work. As a consequence, in this study, we present a deep learning-based language development screening model based on word and part-of-speech and investigate the effectiveness of a large-scale language model. For the experiment, we collected data from Korean children by transcribing the utterances of children aged 2, 4, and 6 years. Convolutional neural networks and the notion of Siamese networks, as well as word and part-of-speech information, were used to determine the language development level of children. We also investigated the effectiveness of employing KoBERT and KR-BERT among Korean-specific large-scale language models. In 5-fold cross-validation study, the proposed model has an average accuracy of 78.0%. Furthermore, contrary to predictions, the large-scale language models were shown to be ineffective for representing children’s utterances.

Keywords:

language development screening for children; linguistic knowledge; Siamese networks; convolutional neural networks; large-scale language model

1. Introduction

Language is an innate skill of individuals that develops as they grow through experiences and interactions with their environment, as well as genetic variables. As a result, language development is intimately linked to cognitive ability, intelligence, and social skills development [1]. Under these conditions, improper language development in childhood can result in language impairments, which can have a severe impact on other elements of life, such as academic achievement, economic potential, and social relationships [2,3,4,5]. If a child’s language development is delayed or if it is determined that such a delay is impacting his or her learning abilities, speech therapy should be started as soon as feasible because treatment works best if it begins before the child’s language and comprehension have improved. For instance, if this problem is not addressed early, the child may become a slow learner with poor learning abilities. This problem has a negative influence on school life, and researchers are working on early detection studies [6,7]. As a consequence, it is critical to recognize and diagnose a child’s language development level or language issue as early as possible.

For a long period in the English-speaking world, Language Sample Analysis (LSA) [8] has been created to overcome this problem by utilizing linguistic information such as morphology and syntax. This requires transcribed data of children’s utterances, and speech-language pathologists evaluate the child’s language development level based on LSA results. However, this analytical procedure takes a significant amount of time, effort, and skill. As a response, LSA software is being developed and widely deployed at the moment. Because English is nearly the only language supported by this program, it cannot be fully utilized in other languages. Furthermore, because it takes a long time for speech-language pathologists to learn how to utilize software and understand analytic findings, it is difficult to use successfully [9]. The main issue, however, is that parents frequently miss the correct diagnosis and treatment time since it is difficult for parents to evaluate a child’s language development level [10].

In this study, we use deep learning to objectively and readily identify children’s language development levels and investigate the feasibility of employing a large-scale language model. In the data transcribed from Korean children’s utterances, we used word and part-of-speech as features. As a person grows up, so does his or her linguistic competence, which finally leads to the usage of grammatically complete phrases. For example, a six-year-old child speaks in more grammatically natural phrases than a two-year-old. Rather than analyzing linguistic knowledge such as vocabulary or parts of speech separately, it is necessary to analyze it comprehensively in this challenge. That is why we consider that simultaneously learning vocabulary and part-of-speech features is more effective than the usual method of learning them separately. In other words, we think it is important to look at both lexical and grammatical patterns simultaneously. Therefore, we utilized Siamese networks [11] and Convolutional neural networks (CNN) [12] for this. The CNN not only perform well in computer vision [13,14,15] but they have also lately performed well in natural language processing (NLP). The Siamese network learns by sharing weights of the neural network to calculate the similarity of the two inputs. As a consequence, we created a language development screening model with CNN based on the notion of Siamese networks to simultaneously learn rules in words and part-of-speeches. Moreover, Large-scale language models such as KoBERT (https://github.com/SKTBrain/KoBERT (accessed on 24 September 2021)) and KR-BERT [16] were compared to cases without large-scale language models for word representation. We created a model for part-of-speech representation that was pretrained with Word2Vec [17] and compared it to a model that did not use it.

The following is how this paper is organized. Section 2 discusses prior related works. The type and properties of the transcription data we utilized as data are described in Section 3. Section 4 provides in-depth information about our proposed model, and Section 5 provides in-depth information about the experimental results and its debate. Finally, Section 6 summarizes and draws inferences from our findings.

2. Related Work

2.1. Language Sample Analysis

The process of collection, transcription, and analysis should be performed by language sample analysis (LSA). Counseling professionals use a standardized utterance collecting procedure to collect children’s utterances and then develop transcription data by transcribing the collected utterances into text. On transcription data, the expert examines numerous measuring indicators [18,19] based on language expertise such as syntax, semantics, and pragmatics. These findings support the general degree of language development seen in the same age group when compared to statistical values (analysis criteria). Because this LSA compares the child’s utterance with the usage pattern of linguistic information appearing in the sample group, the child’s language development level may be reliably confirmed, and this approach is utilized in many languages. However, there are several limits to LSA [20]: (1) training a speech-language pathologist who can effectively evaluate speech involves a significant amount of time and work. (2) A significant amount of time is necessary since the speech-language pathologist performs all operations such as collection, transcription, and analysis directly. (3) Maintaining analysis consistency is challenging since the findings of the analysis may differ based on the expert who participated in the diagnosis or the method employed.

Various technologies, such as Systematic Analysis of Language Transcripts (SALT) [21] and Computerized Language Analysis (CLAN) [22], have been created and utilized in English-speaking nations to address this restriction of LSA. They are constantly amassing transcribing data and upgrading indicators based on this to make the analysis criteria more general.

2.2. Automatic Screening Tools for Language and Speech Impairment

Currently, continual efforts are being made in the English-speaking world to create technologies to automatically classify children for language and speech impairments [23,24]. They tried to perform evaluation by utilizing only phonetics-based methods. They extracted information such as pitch and pause of speech from speech signals and used them as features and classified children’s speech and language impairments using machine learning techniques. Existing researchers prefer speech signals because they save time by skipping the transcription step and make data collecting easier. Speech signals, in particular, are highly effective in discriminating against these disorders because they exhibit several characteristics associated with speech organs, such as breathing, vocalization, and articulation processes.

In a prior study, we used a language processing model based on deep learning and simple statistics to classify five age groups (pre-school children, elementary school, middle/high school, adult, and senior citizen) [25]. However, it is necessary to focus on children since difficulties with language development that begin in childhood result in lifetime effects. As a consequence, the focus of this study is on children, and we explore the applications of the language processing model effectively for this challenge.

3. Transcription Data

The expert (counselor) provides counseling to the child (counselee), records the conversation, and generates transcription data by capturing the recorded data as text in line with the protocol. All utterances made by the expert and the child during the counseling procedure are included in the transcription data. Children’s utterances should be gathered in the same setting and conditions as much as feasible in order to evaluate the level of language development with high reliability. As a consequence, the expert adheres to a standardized utterance collecting procedure (conversation protocol). The conversation protocol’s primary function is to keep the expert from interfering with the flow of the discussion or the content of the children’s replies as much as possible. Only when professional interaction is minimal can a child’s pure level of language development be tested. Furthermore, properly delineating the boundaries between each speech uttered in counseling aids in determining the child’s language development level [26,27].

Transcription data gathered by the Hallym Conversation Protocol [28] created by the Division of Speech Pathology and Audiology in Hallym University was used in this study. When collecting transcription data, the expert’s participation is limited to asking questions to elicit replies from the child and passively reacting to the child’s responses. Transcription data from children ages 2, 4, and 6 (which implies a 2-year gap) were employed in this study to discern developmental changes. Each child’s transcription data comprised an average of 151 utterances and were gathered for a total of 111 children. Table 1 shows the simple statistics of transcription data obtained by the age group, and Table 2 shows transcription data samples.

4. End-to-End Language Development Screening Model

The architecture of the language development screening model suggested in this study is shown in Figure 1. First, we extract children’s utterances from the transcription data and remove the noise. The word feature is generated via tokenization, and the part-of-speech feature is generated by transforming each word into a part-of-speech after completing a part-of-speech analysis. These two features are simultaneously trained on the Siamese network-based convolutional layer, and the feature map extracted here performs max-pooling and average-pooling. Following that, the max and average pooling outputs of each input are concatenated and trained in the first dense layer, and these two outputs are concatenated and trained in the second dense layer. To predict the age group of children, the output of the second dense layer is trained as a softmax function in the output layer. The rest of this section examines each component in detail.

Furthermore, we investigate how effective a large-scale language model that has lately achieved good performance in NLP tasks is in this challenge. We employed the following Korean-specific BERT models for word features: KoBERT and KR-BERT. We constructed a part-of-speech language model using the spoken language corpus from the National Institute of Korean Language (https://corpus.korean.go.kr (accessed on 25 August 2020)) and Word2Vec.

4.1. Preprocessing

LSA removes words that are not suited for screening the level of language development, such as maze words, silence pauses, and a word that implies inaccurate pronunciation. However, because the language system is not complete when children are too young, we thought that these features might be useful in validating the level of language development depending on age. As a consequence, the maze words were deleted, and silent pauses and inaccurate pronunciation were confronted by picking certain symbols. The specific symbols selected in this manner are shown in Table 3.

Languages generate sentences with set rules (or patterns) such as grammars. While these rules have a significant impact on written languages, spoken languages have a considerably more open and flexible structure than written languages. Therefore, we hypothesize that the rules that appear in a language may differ depending on the level of language development. We utilized part-of-speech to verify this characteristic. Table 4 shows the Korean part-of-speech system employed in the study, and Figure 2 compares the Korean and English part-of-speech systems.

4.2. Tokenization for Korean

By processing words at the character level, subword tokenization can solve the Out-of-Vocabulary (OOV) problem. For example, an abbreviation such as “OMG (Oh my god),” which is not in the vocabulary, is tokenized into subwords such as “O”, “M”, and “G”. A word vector for “OMG (Oh my god)” may be represented by a combination of each alphabet (“O”, “M”, and “G”) in the vocabulary when handled as a subword in this manner. This method also has the benefit of allowing all cases of the alphabet to be included in the vocabulary at a low cost. However, because Korean words are agglutinative languages, they are represented by combining syllables made up of consonants and vowels, and a Korean-specific subword tokenization approach was investigated [29,30,31].

In addition, the Korean language has a different method of generating noun and verb phrases. The postposition is merged after the stem in noun phrases, while the ending is merged after the stem of the verb in verb phrases. By applying Byte-Pair-Encoding (BPE) [32] in bi-directions, the BidirectionalWordPiece Tokenizer [16] proposed by KR-BERT can solve this problem. Therefore, the BidirectionalWordPiece Tokenizer was employed in this study.

4.3. Large-Scale Language Model: BERT

BERT (Bidirectional Encoder Representations from Transformers) [33] is a pre-trained language model in which the encoder of Transformer [34] is designed with many layers of a bidirectional structure, demonstrating state-of-the-art performance in various NLP applications. BERT, such as other natural language models, generates embeddings by reflecting context, but it has a few notable characteristics. (1) It solves the OOV problem using WordPiece Tokenizer. (2) It adds the symbols “[CLS]” and “[SEP]” to the start and end of a sentence, respectively. Moreover, the segment feature is used in conjunction with these symbols so that the information of the phrase in the context may be trained. (3) It uses positional encoding to represent the positional information at which the token appears in the context. As a consequence, the information about each token is combined to generate the BERT input, as shown in Figure 3.

We compare KoBERT and KR-BERT in this study. KoBERT is a Korean-specific BERT model released by SKT Brain (https://github.com/SKTBrain/KoBERT (accessed on 24 September 2021)) and was trained with Korean Wikipedia containing 25 M sentences (324 M words). KR-BERT is a Korean-specific BERT model released by Seoul National University (https://github.com/snunlp/KR-BERT (accessed on 11 January 2021)) [16], and it was trained with a self-generated corpus containing 20 M sentences (233 M words).

4.4. Part-of-Speech Embeddings

In the NLP task, tokens are split according to the token-unit of each language, and this is represented as a vector. (i.e., the token-unit of English is a word, the token-unit of Korean is a morpheme.) This is known as word embedding, and because it can be performed in a distributed representation, it may effectively express similarity between words. When screening the level of language development, however, utterances made in the same environment and conditions are analyzed. As a consequence, transcription data are gathered by conversations about the same topic. In this case, information such as the child’s lexical diversity may be verified, but regardless of the child’s level of language development, numerous similar words can appear.

The rule in language, on the other hand, can be compared across age groups and a substantial difference between normal children and children with language impairments may be recognized; hence, it is utilized as a measuring indicator in LSA. Part-of-speech embeddings utilizing Word2Vec are produced to take advantage of the rule in the parts-of-speech that make up sentences. Word2Vec trains a neural network on words represented by one-hot encoding, which is a sparse representation, and represents the weight of each word calculated from the hidden layer as a word vector. The training methods of Wor2Vec are CBOW (Continuous Bag-Of-Words) and Skip-gram, as shown in Figure 4.

We used a spoken language dataset (https://corpus.korean.go.kr/main.do (accessed on 25 August 2020)) of 233 M morphemes given by National Institute of Korean Language to generate pre-trained part-of-speech embeddings. It analyses the part-of-speech, transforms all words to the relevant part-of-speech, and trains with it. We also used Word2Vec’s Skip-gram model.

4.5. Convolution Neural Networks with Siamese Network

In NLP, the input of the CNN uses an embedding matrix (

x_{1 : L} \in ℝ^{L \times k}

) composed of

L

word vectors (

x_{i} \in ℝ^{k}

) represented in

k

-dimension. The convolution operation is carried out in the convolution layer using a filter with a weight of

w \in ℝ^{n \times k}

. It calculates a new feature (

c_{i}

) from

n

word vectors represented in

k

-dimension such as n-gram, and it is shown in Equation (1):

c_{i} = f (w \cdot x_{i : i + n - 1} + b),

(1)

c = [c_{1}, c_{2}, c_{3}, \dots, c_{n - j + 1}],

(2)

where f is an active function, and b is a bias value, which is an example of a convolution filter stride = 1. The convolution filter is applied to all possible word vectors, and these features are integrated to generate a feature map

c

, as shown in Equation (2). Each area of the feature map

c

is applied to pooling operations such as max-pooling and average-pooling. Max-pooling extracts the maximum value of the area and average-pooling extracts the average of the area.

The Siamese network was created for one-shot learning, which allows accurate training with very little data. It trains by sharing weights between two inputs in a neural network with the same architecture. The distance between the neural network’s outputs (such as embeddings) for each input is then computed using methods such as L1 Norm and L2 Norm. If the two inputs are the same class, they are trained to be near one another, but if they are a different class, they are trained to be apart from one another. As a consequence, the Siamese network may be thought of as a network that extracts features from the input using embedding vectors. Figure 5 is an example of a Siamese network architecture.

In this study, we applied the notion of Siamese network to the CNN to simultaneously find linguistic rules in word and part-of-speech by age group. In other words, the convolutional layer shares weights when learning word and part-of-speech features. In the convolution layer, the size of the convolution filter is 3, the number of filters is 256, and the activation function is the Gaussian Error Linear Unit (GELU) function; the GELU function is shown in Equation (3):

f (x) = x \times Φ (x),

(3)

where

Φ (x)

is a Gaussian distribution in relation to

x

. For each feature map (

c_{w o r d}

,

c_{p o s}

) for word and part-of-speech, we use max-pooling and average-pooling together to compute maximum and average values. The maximum and average values for

c_{w o r d}

and

c_{p o s}

are concatenated (

p o o l_{w o r d}

and

p o o l_{p o s}

) and transferred to the fully connected layer at this point.

4.6. Fully Connected Layer

A fully connected (FC) layer is a layer connected to all nodes (neurons) of the preceding layer, and it is the most basic structure of a neural networks. The FC layer serves three functions in this study. The number of hidden nodes in the first and second FC layers was set to the same number as the size of the vector received from the preceding layer. The first FC layer is individually trained for

p o o l_{w o r d}

and

p o o l_{p o s}

. Moreover, the number of hidden nodes is set to 512, and the activation function is set to the GELU function. The individual trained first FC layer’s two outputs are concatenated and then transferred to the second FC layer. In the second FC layer, we set the number of hidden nodes to 1024 and the activation function to the GELU function. The final FC layer serves as an output layer, classifying the language development level using these features, and the activation function set to the softmax function. The softmax function is shown in Equation (4).

f (\vec{x}) = \frac{e^{x_{k}}}{\sum_{k = 1}^{K} e^{x_{k}}} (k = 1, \dots, K),

(4)

4.7. Normalization and Regularization

For normalization, we used LayerNormalization [35] for feature maps

c_{w o r d}

and

c_{p o s}

. LayerNormalization is applied to each feature (

c_{i}

) in the feature map (

c

) independently, normalized to

c_{i}

’s mean (

μ

) and variance (

σ

). Furthermore, it can be used on both training and test data and has high time efficiencies. LayerNormalization is shown in Equations (5)–(7):

μ_{i} = \frac{1}{H} \sum_{j = 1}^{H} x_{i j},

(5)

σ_{i} = \sqrt{\frac{1}{H} \sum_{j = 1}^{H} {(x_{i j} - μ_{i})}^{2}},

(6)

{\hat{x}}_{i j} = \frac{x_{i j} - μ_{i}}{\sqrt{σ_{i}^{2} + ϵ}}

(7)

where

H

is the number of features in that layer, and

x_{i j}

is the

i

-th vector in the

j

-th batch. For regularization, we use Dropout [36] for the output of the first FC layer. At this time, the drop rate is set to 0.3.

5. Experimental Results and Discussion

In this section, we describe the data used in the experiment and the findings of the experiment on the performance of the baselines, including the proposed model. To ensure consistency and reliability of the analysis, LSA examines only a set number of utterances, such as 3–50 among the utterances generated by children [37,38,39,40]. The data we collected were data that speech-language pathologists judged to be normal, and the sample size was very small. To increase the sample size and define the evaluate criteria, we divide the number of each child’s utterances by 20. The description of the generated data is shown in Table 5.

We compared the performance of various deep learning-based approaches and large-scale language models in the experiment, which are described as follows:

LSTM-M2O. A model with a many-to-one architecture for Long Short-Term Memory (LSTM);
LSTM-M2M. A model with a many-to-many architecture for LSTM;
CNN-MP. A model with max-pooling operation for CNN;
CNN-AP. A model with average-pooling operation for CNN;
CNN-MP/AP. A model with max-pooling and average-pooling operation simultaneously for CNN [41];
CNN + LSTM. A model that training the output of a CNN-MP on an LSTM-M2M [42];
LSTM + CNN. A model that training the output of a LSTM-M2M on an CNN-MP;
Transformers. A model with basic Transformers architecture for classification task.

We also compared the models that were trained with only word and the models that were trained with both word and part-of-speech. Finally, in the proposed method, we compared the usage of the Siamese network to the non-use of the Siamese network to observe if our hypothesis was true. The evaluation metric was validated using 5-fold cross validation as an average. Table 6 shows the outcomes of the experiments.

As shown in Table 6, regardless of the large-scale language model, LSTM is not suitable to classify children’s language development levels. As a consequence, we consider it is difficult to discover linguistic rules in children’s utterances as time-series characteristics. In contrast, CNN has confirmed that relevant features for screening children’s language development levels in word and part-of-speech appearing in a certain window size are found. With an accuracy of 78%, our proposed model outperformed the other models. It is considered that by sharing weights for word and part-of-speech, linguistic rules could be used effectively. However, the CNN-MP/AP model using only word also showed a fairly high performance of 77.2%. The confusion matrix was used to compare the performance of the two models, and the findings are shown in Table 7.

As shown in Table 7, the two models correctly classified 2-year-old children, but that there was a difference in performance between 4-year-old and 6-year-old children. This issue emerged in the majority of models, and even the low-performance models could not correctly classify the 2-year-old. In fact, children’s language development rapidly develops at the age of four, and the grammar and form of the mother tongue are completed by the age of five to six. For this reason, the deep learning-based model is considered to have trouble differentiating between 4-year-old and 6-year-old.

In the case of a large-scale language model, significant performance could not be verified even if the same model was used downstream. Because a large-scale language model is trained with a written language, it is considered that representing the spoken language will be difficult. Moreover, children’s language systems are incomplete, and it is not considered appropriate that the large-scale language model which trained a corpus based on adults.

6. Conclusions

In this study, deep learning was introduced to objectively and readily identify a child’s language development level, and the feasibility of a large-scale language model was investigated. In this case, we employed linguistic rules such as word and part-of-speech and the CNN based on the notion of Siamese networks for effectively using them. Our proposed model showed an accuracy of about 78%, and it was verified that a large-scale language model was inadequate for solving this challenge.

The deep learning-based approach we propose has several advantages. First, because the challenge is solved using an end-to-end architecture, there is no need for a separate language analysis. In LSA, utterances had to be transcribed according to established rules in process of using LSA software, and a separate language analysis tool was required. As a reason, the performance of language analysis tools has a significant impact on LSA, but we have avoided this challenge. Second, the deep learning-based approach is capable of generalized inference (or evaluation). In contrast, LSA evaluates the outcomes of the analysis by comparing them to those of the same age group. Furthermore, in order to improve the reliability of evaluation, it is important to acquire transcription data on a regular basis, which is difficult to perform. Finally, deep learning-based approach can discriminate between children who have delayed language development and those who have rapid language development.

However, we also discovered some limitations: First, collecting high-quality data is challenging due to a lack of interest in children’s language development. Second, there is a lot of inappropriate data that cannot be used within the collected data. Third, the quality of the data collected may vary based on external factors such as the expert’s competency or the child’s personality.

In the future, we will find a linguistic pattern that can effectively discriminate between 4-year-old and 6-year-old. We will also create a model architecture that can effectively train these features. In addition, we will analyze the transcription data to see if there are any outliers and, if there are, their causes. Furthermore, we will explore methods to utilize the large-scale language model in this challenge. As a consequence, we anticipate the ability to construct a deep learning-based model with an end-to-end architecture that can more correctly evaluate children’s language development levels than previously possible.

Author Contributions

Conceptualization, Y.-S.K.; data curation, Y.-K.L. and B.-D.O.; formal analysis, B.-D.O., J.-D.K. and C.-Y.P.; funding acquisition, Y.-K.L. and Y.-S.K.; methodology, B.-D.O.; project administration, Y.-S.K.; resources, Y.-K.L., C.-Y.P. and J.-D.K.; supervision, Y.-S.K.; validation, B.-D.O.; writing—original draft, B.-D.O.; writing—review and editing, Y.-S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF) (NRF-2019S1A5A2A03052093), Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (NO. 2021-0-02068, Artificial Intelligence Innovation Hub (Seoul National University)) and ‘R&D Program for Forest Science Technology (Project No. 2021390A00-2123-0105)’ funded by Korea Forest Service (KFPI).

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and is approved by the Institutional Review Board of Hallym University (HIRB-2019-036).

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sirbu, A. The significance of language as a tool of communication. Sci. Bull. Mircea Cel Batran Nav. Acad. 2015, 18, 405. [Google Scholar]
Bird, J.; Bishop, D.V.; Freeman, N.H. Phonological awareness and literacy development in children with expressive phonological impairments. J. Speech Lang. Hear. Res. 1995, 38, 446–462. [Google Scholar] [CrossRef] [PubMed]
Conti-Ramsden, G.; Botting, N. Social difficulties and victimization in children with SLI at 11 years of age. J. Speech Lang. Hear. Res. 2004, 47, 145–161. [Google Scholar] [CrossRef]
National Acedemies of Sciences, Engineering, and Medicine. Speech and Language Disorders in Children: Implications for the Social Security Administration’s Supplemental Security Income Program; The National Academies Press: Washington, DC, USA, 2016. [Google Scholar]
Hulme, C.; Snowling, M.J.; West, G.; Lervåg, A.; Melby-Lervåg, M. Children’s language skills can be improved: Lessons from psychological science for educational policy. Curr. Dir. Psychol. Sci. 2020, 29, 372–377. [Google Scholar] [CrossRef]
Kaur, P.; Singh, M.; Josan, G.S. Classification and prediction based data mining algorithms to predict slow learners in education sector. Procedia Comput. Sci. 2015, 57, 500–508. [Google Scholar] [CrossRef] [Green Version]
Rana, S.; Garg, R. Slow learner prediction using multi-variate naïve bayes classification algorithm. Int. J. Eng. Technol. Innov. 2017, 7, 11–23. [Google Scholar]
Leadholm, B.; Miller, J.F. Language Sample Analysis: The Wisconsin Guide. Bulletin 92424; Wisconsin Department of Public Instruction: Milwaukee, WI, USA, 1994.
Pezold, M.J.; Imgrund, C.M.; Storkel, H.L. Using computer programs for language sample analysis. Lang. Speech Hear. Serv. Sch. 2020, 51, 103–114. [Google Scholar] [CrossRef] [Green Version]
Tomblin, J.B.; Records, N.L.; Buckwalter, P.; Zhang, X.; Smith, E.; O’Brien, M. Prevalence of specific language impairment in kindergarten children. J. Speech Lang. Hear. 1997, 40, 1245–1260. [Google Scholar] [CrossRef] [Green Version]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the 32nd International Conference on Machine Learning Deep Learning Workshop, Lille, France, 6–11 July 2015. [Google Scholar]
Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008. [Google Scholar]
Hussain, M.; Bird, J.J.; Faria, D.R. A study on cnn transfer learning for image classification. In Proceedings of the 18th Annual UK Workshop on Computational Intelligence, Nottingham, UK, 5–7 September 2018. [Google Scholar]
Jmor, N.; Zayen, S.; Abdelkrim, A. Convolutional neural networks for image classification. In Proceedings of the 2018 International Conference on Advanced System and Electric Technologies, Hammamet, Tunisia, 22–25 March 2018. [Google Scholar]
Bhagwat, R.; Dandawate, Y. A framework for crop disease detection using feature fusion method. Int. J. Eng. Technol. Innov. 2021, 11, 216–228. [Google Scholar] [CrossRef]
Lee, S.; Jang, H.; Baik, Y.; Park, S.; Shin, H. Kr-bert: A small-scale Korean-specific language model. arXiv 2020, arXiv:2008.03979. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Schober-Peterson, D.; Johnson, C.J. The performance of eight-to ten-year-olds on measures of conversational skilfulness. First Lang. 1993, 13, 249–269. [Google Scholar] [CrossRef]
Owen, R.E., Jr. Language Disorders: A Functional Approach to Assessment and Intervention, 6th ed.; Allyn and Bacon: Boston, MA, USA, 2013. [Google Scholar]
Paul, R.; Norbury, C. Language Disorders from Infancy through Adolescence, 4th ed.; Elsevier Health Sciences: St. Louis, MO, USA, 2012. [Google Scholar]
Miller, J.F.; Iglesias, A. Systematic Analysis of Language Transcripts (SALT), Version 16 [Computer Software]; Salt Software: Middleton, WI, USA, 2015. [Google Scholar]
MacWhinney, B. The CHILDES Project: Tools for Analyzing Talk: Volume I: Transcription Format and Programs, Volume II: The Database, 3rd ed.; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
Maier, A.; Haderlein, T.; Eysholdt, U.; Rosanowski, F.; Batliner, A.; Schuster, M.; Nöth, E. PEAKS-A system for the automatic evaluation of voice and speech disorders. Speech Commun. 2009, 51, 425–437. [Google Scholar] [CrossRef] [Green Version]
Gong, J.J.; Gong, M.; Levy-Lambert, D.; Green, J.R.; Hogan, T.P.; Guttag, J.V. Towards an Automated Screening Tool for Developmental Speech and Language Impairments. In Proceedings of the 17th Interspeech, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar]
Oh, B.D.; Lee, Y.K.; Song, H.J.; Kim, J.D.; Park, C.K.; Kim, Y.S. Age group classification to identify the progress of language development based on convolutional neural networks. J. Intell. Fuzzy Syst. 2021, 40, 7745–7754. [Google Scholar] [CrossRef]
Ahn, B.S. A study on utterance as prosodic unit for utterance phonology. J. Korean Stud. 2007, 26, 233–259. [Google Scholar]
Kim, J.M. Utterance boundary classification in spontaneous speech of pre-school age children. Lang. Ther. Res. 2013, 22, 41–54. (In Korean) [Google Scholar]
Park, Y.J.; Choi, J.; Lee, Y. Development of Topic Management Skills in Conversation of School-Aged Children. Commun. Sci. Disord. 2017, 22, 25–34. [Google Scholar] [CrossRef]
Park, S.; Shin, H. Grapheme-level awareness in word embeddings for morphologically rich languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
Park, K.; Lee, J.; Jang, S.; Jung, D. An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Virtual Conference, 4–7 December 2020. [Google Scholar]
Kim, B.; Kim, H.; Lee, S.W.; Lee, G.; Kwak, D.; Jeon, D.H.; Park, S.; Kim, S.; Kim, S.; Seo, D.; et al. What changes can large-scale language models bring? Intensive study on HyperCLOVA: Billions-scale Korean generative pretrained transformers. arXiv 2021, arXiv:2109.04650. [Google Scholar]
Sennrich, R.; Haddow, B.; Birch, A. neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Harris, M.; Jones, D.; Brookes, S.; Grant, J. Relations between the non-verbal context of maternal speech and rate of language development. Br. J. Dev. Psychol. 1986, 4, 261–268. [Google Scholar] [CrossRef]
Ingram, D. The measurement of whole-word productions. J. Child Lang. 2002, 29, 713–733. [Google Scholar] [CrossRef] [PubMed]
Andonova, E. Parental report evidence for toddlers’ grammar and vocabulary in Bulgarian. First Lang. 2015, 35, 126–136. [Google Scholar] [CrossRef] [Green Version]
Trudeau, N.; Sutton, A. Expressive vocabulary and early grammar of 16- to 30-month-old children acquiring Quebec French. First Lang. 2011, 31, 480–507. [Google Scholar] [CrossRef]
Oh, B.D.; Kim, Y.S. Lightweight text classifier using sinusoidal positional encoding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Virtual Conference, 4–7 December 2020. [Google Scholar]
Karlekar, S.; Niu, T.; Bansal, M. Detecting linguistic characteristics of Alzheimer’s dementia by interpreting neural models. In Proceedings of the 16th Conference of the North American Chapter of the Association for Computational Linguistics, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]

Figure 1. Architecture of language development screening model.

Figure 2. Example of part-of-speech system of Korean and English.

Figure 3. Examples of features and structures used as inputs of BERT.

Figure 4. Architecture of Word2Vec. (a) CBOW model: It represents a vector of the central word through the surrounding words. (b) Skip-gram model: It represents a vector of surrounding words through the central word. Negative sampling is used to reduce computation that increases according to the vocabulary’s size.

Figure 5. Example of Siamese network architecture.

Table 1. Description of transcription data.

Age	People	Utterances (Avg.)	Tokens
2-year-old	20	131	6K
4-year-old	48	155.56	27K
6-year-old	43	167.56	32K
Total	111	151.37	65K

Table 2. Example of transcription about 2-year-old child.

Topic	Read a Book
Speaker	Utterance (Korean)	Utterance (English)
Expert (Question)	곰돌이는 지금 뭐하고 있어?	What is teddy bear doing now?
Child	말타고 있어.	Riding a horse.
Expert (Question)	그렇구나. 곰돌이가 어디 가는 거야?	Okay. So where is the teddy bear going?
Child	집에.	At home.
Expert (Reaction)	아 집에 가는구나.	Oh, the teddy bear is going home.
Child	옆에 코끼리 있어.	There is an elephant next to the teddy bear.
…

Table 3. Type of preprocessed vocabulary.

Type	Description	Symbol
Maze words	Repetitions Revisions	-
Silence pauses	More than 3 s	NOSPK
Inaccurate pronunciation	-	UNK

Table 4. Korean part-of-speech system.

Type of Korean Part-of-Speech
Noun Pronoun Numeral Verb Adjective Adverb
Determiner Postposition (Josa) Interjection Ending

Table 5. Description of data used in experiment.

Age	Data	Length (Avg.)
2-year-old	113	41.17
4-year-old	303	67.55
6-year-old	283	79.30
Total	699	62.67

Table 6. Experimental results.

Type	Methods	Acc (Avg.)
Word	Large-scale LM
	KoBERT	65.8
	+LSTM-M2O	62.3
	+LSTM-M2M	61.6
	+CNN-MP/AP	69.7
	KR-BERT	69.4
	+LSTM-M2O	67.0
	+LSTM-M2M	63.8
	+CNN-MP	72.4
	+CNN-AP	73.4
	+CNN-MP/AP	73.9
	+Dense for each pooling operation	75.1
	Not-use
	LSTM-M2O	66.5
	LSTM-M2M	66.5
	CNN-MP	66.0
	CNN-AP	71.2
	CNN-MP/AP	75.0
	+Dense for each MP/AP	77.2
	CNN + LSTM	70.4
	LSTM + CNN	71.7
	Transformer	74.2
Word + Part-of-Speech	Large-scale LM and PoS Embeddings
	KR-BERT
	+CNN-MP/AP	73.4
	Not-use
	CNN-MP	69.3
	CNN-AP	72.1
	CNN-MP/AP	73.9
	+Dense for concat output	77.1
	+Dense for each output	76.3
	Transformer	75.0
	Proposed Model	78.0
	+Not use notion of Siamese network	73.96

Table 7. Confusion matrix. Each value is an average in 5-fold.

Proposed Model	2-Year-Old	4-Year-Old	6-Year-Old
2-year-old	20.4	2.2	0
4-year-old	1.2	46.6	12.8
6-year-old	0	14.6	42
CNN-MP/AP	2-Year-Old	4-Year-Old	6-Year-Old
2-year-old	20.8	1.8	0
4-year-old	1.4	46.6	12.6
6-year-old	0	16	40.6

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oh, B.-D.; Lee, Y.-K.; Kim, J.-D.; Park, C.-Y.; Kim, Y.-S. Deep Learning-Based End-to-End Language Development Screening for Children Using Linguistic Knowledge. Appl. Sci. 2022, 12, 4651. https://0-doi-org.brum.beds.ac.uk/10.3390/app12094651

AMA Style

Oh B-D, Lee Y-K, Kim J-D, Park C-Y, Kim Y-S. Deep Learning-Based End-to-End Language Development Screening for Children Using Linguistic Knowledge. Applied Sciences. 2022; 12(9):4651. https://0-doi-org.brum.beds.ac.uk/10.3390/app12094651

Chicago/Turabian Style

Oh, Byoung-Doo, Yoon-Kyoung Lee, Jong-Dae Kim, Chan-Young Park, and Yu-Seop Kim. 2022. "Deep Learning-Based End-to-End Language Development Screening for Children Using Linguistic Knowledge" Applied Sciences 12, no. 9: 4651. https://0-doi-org.brum.beds.ac.uk/10.3390/app12094651

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based End-to-End Language Development Screening for Children Using Linguistic Knowledge

Abstract

1. Introduction

2. Related Work

2.1. Language Sample Analysis

2.2. Automatic Screening Tools for Language and Speech Impairment

3. Transcription Data

4. End-to-End Language Development Screening Model

4.1. Preprocessing

4.2. Tokenization for Korean

4.3. Large-Scale Language Model: BERT

4.4. Part-of-Speech Embeddings

4.5. Convolution Neural Networks with Siamese Network

4.6. Fully Connected Layer

4.7. Normalization and Regularization

5. Experimental Results and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI