Standardized Versus Naturalized: An Evaluation of Child Morphological and Syntactic Assessments

Speech-language pathologists may choose to evaluate children’s language using standardized or naturalized assessments. This study investigated if the Clinical Evaluation of Language FundamentalsPreschool 2 (CELF-P 2), a standardized assessment, and language sampling, a naturalized assessment, reveal the same information about children’s linguistic competence and performance. Children ages 3.0-7.0 were assessed with specific focus on morphology and syntax. The participants completed four morphosyntactic-based subtests of the CELF-P 2. Additionally, play-based interactions, used to elicit natural language, were video-recorded. The CELF-P 2 was scored and language samples were transcribed and analyzed. Mean length of utterance (MLU) scores showed a slightly more variable trend around the mean than CELF-P 2 scores and there were no significant correlations between the two assessments. Furthermore, the two forms of assessment produced incongruous age equivalents for 66% of the participants (four out of six) and participants produced different morphosyntactic structures during each type of assessment. Thus, results indicated limitations and successes of the different assessment approaches. When used alone, either form of assessment did not provide a completely accurate representation of children’s language acquisition. However, when used in conjunction, the two assessments may represent the linguistic competence and performance of children more accurately.


INTRODUCTION
Clinicians utilize a variety of evaluation tools to analyze and document a child's language development.According to Condouris, Meyer, & Tager-Flusberg (2003) assessments are used to measure language skills, identify children with language impairments, and monitor the efficacy of treatment.Assessments account for a great amount of a speech-language pathologist's (SLP) professional time and help justify decisions of treatment for clients.As noted by Huang, Hopkins, and Nippold (1997), SLPs spend "21% of their work time in evaluation, indicating that evaluation is a major professional activity" (p.12).
Methods to measure language acquisition include both naturalized and standardized assessments.While naturalized assessments provide insight into a child's language use in everyday settings and allow clinicians to judge a child's overall communication development, standardized assessments are highly structured, objective, and help clinicians to identify impairments through comparison to a normative sample (Tyler & Tolbert, 2002).Spontaneous language samples can be collected using conversation, free play, story generation, and interviews (Evans & Craig, 1992;Southwood & Russell, 2004).Contrary to naturalized assessments, standardized assessments follow a systematic structure (Condouris et al., 2003).Brown (1973) provided a significant source of research for naturalized assessment by studying child language development through the transcription and analysis of spontaneous speech in a longitudinal study.Brown concluded that calculating mean length of utterance (MLU) is an appropriate tool to indicate syntactic and morphological development of preschool children.Brown developed an index of syntactical development in which each successive stage accounts for new linguistic acquisition and grammatical growth.After the fifth stage, the reliability of MLU decreases because the production of longer utterances does not necessarily represent a child's linguistic knowledge (Johnston, 2001).MLU is a frequently-used and widespread measure of language development (Klee, Schaffer, May, Membrino, & Mougey, 1989).Furthermore, as highlighted by Miller and Chapman (1981), MLU is a good predictor of developmental achievement.
MLU, as defined by Rice et al. (2010), is the number of morphemes, or the units of meaning, in each of a young child's natural utterances.Brown's grammatical morphemes appear throughout a young child's life and help shape the meaning of the child's utterances.
As a child transitions through Brown's stages of morphological development, utterances and sentence structure become more complex.Remarkably, children's acquisition of Brown's morphosyntactical stages is accomplished sans formal instruction (Berko Gleason & Bernstein Ratner, 2013;Pinker, 1994).Rice et al. (2010) compared MLU to age-referenced normative data for typically developing children and children with specific language impairment (SLI) (ages 2.6-9.0).A correlation between age progression and MLU was revealed, suggesting that MLU yields reliable and valid results of children's language acquisition in comparison to age (Rice et al., 2010).
As opposed to the naturalized assessment of language sampling and MLU calculation, standardized assessments provide rigid structure, including subtests, which often target specific areas of language, and allow for comparison with a normative sample.Depending on the area of language targeted, the subtests may utilize tools such as picture identification, sentence completion, imitation, and question response to elicit language (Secord, Semel, & Wiig, 2013).
Although norm-referenced standardized assessments play an important role in language evaluation because they provide an objective form of measurement, the validity of the assessments may be questioned (McFadden, 1996).Limitations of standardized assessments include the potentially restricted comparison group, time constraints, and financial burdens.A comparison group represents an ideal population which may not be linguistically comparable in gender, culture, or socioeconomic status (Huang et al., 1997;McFadden, 1996).
While opinions of standardized testing vary, many researchers have concluded that naturalized assessments and standardized assessments alone are not an adequate representation of child language development (Huang et al., 1997;Southwood & Russell, 2004;Tyler & Tolbert, 2002).
Rather, standardized assessments, used in conjunction with naturalized assessments, increase reliability of language evaluations and test all the components of language.For example, Condouris et al. (2003) provided a comparison of standardized and naturalized assessments for children with autism spectrum disorder (ASD).Results of the two forms of assessment revealed comparable data indicating that both standardized and naturalized assessments test the same underlying linguistic functions.Moreover, Tyler and Tolbert (2002) provided a comparison of standardized and naturalized assessments for a single subject within a 90minute time span and found that combining the results of the assessments was ideal.
The present study examined if the results of the Clinical Evaluation of Language Fundamentals-Preschool 2 (CELF-P 2), a standardized assessment, and language sampling, a naturalized assessment, revealed the same fundamental morphosyntactic information about linguistic competence and performance.Ultimately, although both methods assessed similar components of child language development, it was hypothesized that language sampling may reveal a more accurate representation of a child's linguistic competence and performance.

METHOD Participants
Six participants aged 3.0 to 7.0 participated in this study.All participants were typically developing, which was verified by caregivers' completion of a background questionnaire (see Appendix A).Half of the participants were female (ages 4.2, 4.2, and 6.11) and half were male (ages 3.4, 4.8, 6.0).All participants were monolingual speakers of American English.All language samples, with the exception of participant 1, were collected in familiar settings (e.g., home).Participant 1's assessments were conducted at Iona College's Speech Communication Studies Department.

Procedure
Prior to the study, Iona College Institutional Review Board approval was obtained.The legal guardians of the participants were asked to complete a background questionnaire regarding their child's developmental and language histories (see Appendix A).Informed consent was acquired from the legal guardians prior to testing.Following, language samples were collected and four subtests of the CELF-P 2 were administered.Total testing time was approximately two hours.Follow-up and explanation of results with the legal guardians was completed after testing.

Language Sampling
All language samples were approximately 30 minutes, involving a playbased interaction constructed around the interests of the participants.The interactions were recorded using a video camera.The participants were allowed to move freely throughout the setting.The researcher and legal guardians engaged the participants in conversation to elicit volitional speech following the principles of milieu teaching where the adults followed the child's attentional lead (Paul & Norbury, 2012).The adults utilized the participants' choice of toys, which included, for example, board games, dolls, and toy cars, to elicit and maintain communication.Furthermore, the researcher used open-ended prompts about the child's chosen item in order to further elicit language.
The language samples were transcribed by the researcher and checked by a SLP to verify the transcription.Utterance contour, pauses greater than two seconds, and inhalation served as utterance boundaries (Miller & Chapman, 1981).Approximately 120 child utterances were transcribed for each of the participants.For 67% of the samples, the medial portion was transcribed, which is the recommended best practice for language sample analysis (Paul & Norbury, 2012).The first 120 utterances were used for two participants (4 and 6) due to the robust quality and quantity of the utterances in the beginning of the sample.MLU was computed by dividing the number of morphemes in an utterance by the number of utterances produced during the language sample (Brown, 1973).Morphology and syntax were analyzed according to Brown's index of syntactical development (see Table 5.3 of Berko Gleason & Bernstein Ratner, 2013, for complete data).For each language sample, unintelligible phrases and hedges, such as "umms," were not included in the calculation of MLU.Incomplete phrases defined as those containing unintelligible words were not used in the calculation of MLU.All contractions except let's, don't, and won't were counted as two morphemes.Fillers, such as like, were also omitted from calculation.Reformulations, false starts, and repetitions were also not included.Phrases such as uh oh, uh huh, nuh uh, yup, and mhmm were counted in the calculation because they hold semantic meaning.Two SLPs also calculated each transcript's MLU.Reliability between the experimenter who calculated the original MLU and the two SLPs who later analyzed each sample was 83%.
If a discrepancy occurred, an additional SLP was asked to respond and the most frequent response was selected.

CELF-Preschool 2
The CELF-P 2 was utilized as the standardized assessment for this study.The participants were required to complete four subtests of the CELF-P 2: Sentence Structure, Word Structure, Recalling Sentences, and Recalling Sentences in Context.The abovementioned subtests focus on the participant's morphosyntactic development.The Sentence Structure subtest requires the examiner to read a sentence which corresponds to a set of photos.
Participants choose which photograph the sentence described and thus demonstrate their receptive language abilities.The Word Structure subtest examines a participant's expressive mastery of word structure in relation to tense, comparative suffixes, derivational suffixes, possessive, and other grammatical forms.The examiner reads a sentence which corresponds to a picture.The examiner reads another sentence and the participants complete the sentence based on the preceding sentence's structure.For the Recalling Sentences subtest, the examiner reads a sentence and the participants repeat the sentence verbatim.Success in repetition depends upon the participant's number of errors.The Recalling Sentences in Context subtest requires the participant to repeat sentences of a story.The sentences in the subtests vary in structural complexity and length.Both the Recalling Sentences and Recalling Sentences in Context assess the participant's expressive and receptive abilities.
During the evaluation, the researcher read directions to the participants prior to the beginning of each subtest.All protocols were followed as outlined in the CELF-P 2 instruction manual.
Following test administration, all individual subtest raw scores were calculated as standard scores (M = 10, SD = 3).

DATA ANALYSIS Language Sampling
Each participant's language sample was analyzed independently.Utterances ranged in length from one morpheme (participants 1-6) to 46 morphemes (participant 6).Lengthier utterances, contributing to a higher calculation of MLU, were attributed to rambling, explanation of board games, and story-telling.Shorter utterances were attributed to yes/no questions and responses such as "okay."Specific linguistic background is listed in Table 1 for each participant.Only one MLU calculation derived from the language sample (participant 2) was judged to be an inaccurate representation of the child's linguistic competence and performance.While participant 2 produced long utterances, which created a larger MLU, the participant also expressed consistent word and sentence structure errors (e.g., substituted "them" for "they are") throughout the sample.

CELF-Preschool 2
Results varied among subtests and participants for the CELF-P 2. Raw scores, scaled scores, and age equivalents (AE) for participants are listed in Table 2.
Sentence Structure.The Sentence Structure subtest assessed receptive language skills by analyzing the child's ability to understand spoken sentences.Scaled scores ranged from eight to 14.There were no congruous errors among the participants.However, four participants (2, 3, 4, and 6) inaccurately responded to items that addressed the subordinate clause (e.g., before she ate the sandwich) and the passive tense (e.g., is being followed).
Word Structure.The Word Structure subtest evaluated expressive language skills by assessing the child's ability to produce morphological markers and pronouns.Scaled scores ranged from nine to 13.All participants incorrectly identified the contractible/auxiliary copula of "They are."Four participants (2, 3, 4, and 6) inaccurately identified the irregular past tense of "fell."However, of those same participants, only one participant (6) inaccurately identified a subsequent irregular past tense question (blew).
Recalling Sentences.The Recalling Sentences subtest assessed expressive language skills by analyzing a child's ability to repeat sentences without altering word or sentence structure and meaning.Scaled scores ranged from nine to 14. Participant 6 received a score of zero for incompletion of the subtest.All participants incorrectly recalled sentences which included an active declarative with a relative clause (e.g., the dad brought a book for his son who likes funny stories).Four of the five participants inaccurately repeated sentences which included the active declarative with negation (e.g., the kindergartner cannot cross the street by himself), active declarative with noun modification (e.g., the big, brown dog ate all of the cat's food), and active declarative with a subordinate clause (e.g., because tomorrow is Saturday, we can stay up late tonight).
Recalling Sentences in Context.The Recalling Sentences in Context subtest is a supplementary subtest similar to the Recalling Sentences subtest but includes contextual cues through a story.In the subtest, five participants inaccurately repeated the sentence containing an active declarative with a relative clause (e.g., I am very happy that we finally found you, Grandma).Additionally, four participants (2, 3, 4, and 5) were unable to repeat sentences with active declarative with an infinitive clause and negation (e.g., I can't wait to have Grandma come to our house).Furthermore, four participants (2, 3, 5, and 6) were unable to recall the active declaration with coordination (e.g., I fell and dropped my juice).

Language Sampling and Standardized Testing Comparison
Scores at an individual participant level from both the CELF-P 2 and MLU were compared to analyze the variance This was done by calculating z-scores for each participant's individual score.Each participant's MLU was compared to the sample's mean because a population mean was not available.Subtest scores were compared to the population mean of the CELF-P 2. As shown in Table 3, MLU scores showed a slightly more deviant trend around the mean than CELF-P 2 scores.However, most individual scores were within ± 1 standard deviation around the mean.Scores above one standard deviation were only slightly more deviant, with the highest MLU score being 1.5 standard deviations above the mean and the highest CELF-P 2 score being 1.33 standard deviations above the mean.
In addition, Pearson's product moment correlation was used to test the relationship between CELF-P 2 and MLU measures within the current sample.None of the correlations between MLU and the CELF-P 2 subtests reached significance (sentence structure: r = .36,p = .48;word structure: r = .47,p = .35;recalling sentences: r = .36,p = .55).
Age Equivalents (AEs).AEs were determined based on the model provided by Miller and Chapman (1981), which denotes the age at which most children have an MLU equal to that of the children included in the present study.
Age-equivalence scores were included so that MLU and CELF-2 P results could be compared.Five out of six participants (83%) had MLU AEs representative of their chronological ages (CA) (see Table 1).Age-equivalence scores were reported as a reliable and age-validated measure of syntactic growth in children with and without SLI (Rice et al., 2010).Results of that study conducted by Rice et al. ( 2010) revealed MLU calculation is sensitive to language impairment throughout the range for which MLU is considered a reliable index.
MLU calculation and results of the Word Structure subtest produced different AEs for 66% of the participants (four of six participants).Both measures produced comparable AEs for the older participants (participants 1 and 5).For the four younger participants, MLU calculation and results of the CELF-P 2 produced different AEs for 75% of the participants (three of four participants).Although participant 2 had a comparable MLU and Word Structure subtest score, results of the measures were not comparable to the participant's CA (3.4).
Overall, approximately 83% of the participants (five of six participants) had a MLU AE comparable to their CAs.For these five participants, MLU was indicative of their CAs and language acquisition.However, only approximately 33% of the participants (two of six participants) had a standardized assessment AE comparable to their CAs.The two participants were aged 6.0 and 6.11.
Furthermore, participant 2 (age 3.4) had a MLU AE of >5.0 and a Word Structure AE of 4.8.While these results are much higher than expected for a child aged 3.4, the results demonstrate consistency between both measures.Overall, while the MLU AE scores appeared to reflect the participant's CA, the AE results of the CELF-P 2 indicated an older AE score for   5).Congruent results between the CELF-P 2 and language samples were found for 17.6% of items (e.g., prepositions, irregular past tenses, objective pronouns) when both the one time accurate usage and 100% accuracy groups were analyzed.
Congruent results between the CELF-P 2 and language samples were found for 29.4% of items (e.g., objective pronouns, uncontractible copula/auxiliary, irregular past tense, third person singular, prepositions) when only the 100% accuracy group was analyzed for both forms of assessment.An increase in congruent results were found for 47.1% of items (i.e., subjective pronouns, objective pronouns, contractible copula, irregular past tense, regular past tense, progressive -ing, regular plural, prepositions) when only the one time accurate usage group was analyzed.
The CELF-P 2 subtests only contained certain items from each morphosyntactic category.For example, although the CELF-P 2 only tested a limited number of pronouns (e.g., her, him, hers, he, she, herself), the participants utilized a variety of additional pronouns in the naturalized context (e.g., I, you, yours, we, they, my, itself, etc.).Furthermore, the CELF-P 2 tested morphosyntactic categories that were not frequently used during the language samples (e.g., noun derivation).
Overall, the participants showed a trend of producing morphological and syntactic structures during language sampling that they inaccurately produced during administration of the CELF-P 2. All of the participants were unable to produce the contractible copula, "they are," in the Word Structure subtest of the CELF-P 2. In the language sample, each participant produced some form of the contractible copula (e.g., they are, he is).Additionally, individual participants produced morphosyntactic structures during the naturalized assessment that he or she inaccurately produced during the CELF-P 2. For example, participant 3 was unable to produce the superlative "fastest" during the standardized evaluation but produced "biggest" during volitional speech.Some forms were produced accurately by participants during the CELF-P 2 but contained errors during language sampling.For example, participant 2 produced the

DISCUSSION
The primary purpose of this study was to investigate if standardized and naturalized assessments produced comparable results of linguistic competence and performance of children.A larger sample size may show that the standardized and natural measures are measuring the similar language skills while providing slightly different information about the child being tested (Bornstein & Haynes, 1998;Ukrainetz & Blomquist, 2002).

Challenges with the CELF-Preschool 2
Inconsistency of results was noted across the subtests of the CELF-P 2. Discrepancies between subtests may lead to inconclusive results of a child's language abilities.Even though the Recalling Sentences subtest and the Sentence Structure subtest both tested acquisition of syntax, the two subtests did not reveal congruent results (e.g., participants struggled with the passive tense during Sentence Structure but accurately produced it during Recalling Sentences).Furthermore, while participant 5 scored in the 75 th and 84 th percentile for the Sentence Structure and Word Structure subtests, the participant scored in the 37 th percentile for the Recalling Sentences subtest.
Additionally, discrepancies within an individual subtest may lead to inaccurate conclusions regarding language acquisition.For the Word Structure subtest, all participants incorrectly identified the contractible/auxiliary copula of "They are."However, within the same context, five of the six participants all correctly identified the contractible/auxiliary copula of "She is." Furthermore, the complexity and appeal of completing subtests of standardized assessments may influence a child's performance.For example, the Recalling sentences and Recalling Sentences in Context subtests challenged participants to reproduce sentences with multiple clauses (e.g., the dad bought a book for his son who likes funny stories).The length of the utterances and attentiveness of the participants may have increased the number of errors.Participant 6 refused to complete the subtest, claiming it was "boring."Thus, the inconsistencies noted between and among subtests may lead clinicians to question the test's results.

Age Equivalents
Inconsistency in results was noted not only across the subtests of the CELF-P 2, but also between the naturalized and standardized assessments.As compared to the calculation of MLU, which produced accurate AE for five out of six participants, the AE calculated from the CELF-P 2 were only accurate for two of the six participants.The two participants that the CELF-P 2 provided congruent results with CA were the two oldest participants.Thus, language sampling may be more indicative of a child's true morphosyntactic abilities than standardized results for younger children; however, both language sampling and the CELF-P 2 may provide valid morphosynactic results for older children.

Limitations and Future Directions
Limitations of the study included the sample size and reporting of AE scores.A larger sample size would allow for greater generalization of results and possibly yield additional findings.In addition, there are certain limitations associated with the use and reporting of AE scores (Maloney & Larrivee, 2007).Although standard scores are normally distributed, AE scores are not.Therefore, participants who score within a "normal range" may have an AE score that reflects a much lower-or higher-equivalent than their performance.Thus, an AE score reflects the number of items answered accurately as opposed to the quality of the responses.
To improve reliability of MLU calculation, future directions of the study should include a larger number of participants, particularly younger participants.
MLU is considered an inaccurate measure of syntactic growth after Brown's fifth stage of morphological development.Therefore, a larger sample of younger participants would increase the ability to assess the validity of the study.

Conclusion
Assessment is a critical component of a SLP's workload.Previous research (Condouris et al., 2003;Huang et al., 1997;Tyler & Tolbert, 2002) suggests that analyzing a child's language should be completed both standardly and naturally.The present study adds to the literature by extending the results to the evaluation of morphosyntactic components of language specifically.While the present study provides some examples of challenges associated with standardized testing, standardized testing when combined with naturalized testing appeared to yield the most accurate picture of a child's morphosyntactic abilities.Either method of assessment, when used individually, may not accurately represent a child's language abilities.Thus, SLPs may seek to employ both a psychometric and descriptive approach during evaluation of linguistic competence and performance when assessing morphology and syntax to obtain the most accurate results.

Table 1 .
Participants' Language History, MLU, and Age Equivalents

Table 2 .
CELF-P 2 Raw Scores, Scaled Scores, and Age Equivalents around the mean.

Table 5 .
Percent of Participants Who Used Morphosyntactic Items Accurately on the CELF-P 2 Subtest and Language Sample The findings are congruent with Rice et al. (2010) who established age-referenced MLU as a reliable and valid measurement of language acquisition.Akin to the present study, Rice et al. (2010) documented that MLU increased with age progression.