The Lump-Versus-Split Dilemma in Couple Observational Coding: A Multisite Analysis of Rapid Marital Interaction Coding System Data

Historically, observational couple communication researchers have oscillated between splitting behaviors into narrowly defined discrete codes and grouping behaviors into broader codes — sometimes within the same study. We label this the “lump-versus-split dilemma.” Coding across a decade and 11 investigators were used to recommend the most meaningful number of codes to use when observing couples’ conflict. We combined data from 14 studies that used the Rapid Marital Interaction Coding System (RMICS) to score communication behavior during different-sex couples’ conflict interactions. In each study, couples completed at least one 10-minute, video-recorded conflict discussion. Communication during these interactions was coded by trained research staff using RMICS; all codes were compiled into a single data set for descriptive analysis and exploratory factor analyses (EFA). The final sample comprised N = 2011 couples. Several RMICS codes were extremely infrequent—specifically, distress-maintaining attributions, psychological abuse, withdrawal, dysphoric affect, and relationship-enhancing attributions. By far, the most frequent code was constructive problem discussion. EFAs yielded two factors for both women and men. Factor 1 (Negative) contained two items: distress-maintaining attributions and hostility. Factor 2 (Non-Negative) contained constructive problem discussion and humor (and, for women only, acceptance). Results side heavily with the “lump” camp in the lump-versus-split dilemma in couple observational coding. These RMICS factor analysis results converge with those from other systems and imply that the microanalytic “splitting” era in couples coding should draw to a close, with future studies instead focused on negative, neutral, and positive codes.

Keywords: couples, observational coding, conflict, partner communication, factor analysis

The Lump-Versus-Split Dilemma in Couple Observational Coding: Insights from Multiple Studies Using the Rapid Marital Interaction Coding System

A coding system is a work of art in service of science, attempting to quantify the ephemeral. It is scientific in that coding systems are tools to investigate particular constructs (Heyman, 2004). As Bakeman and Gottman (1997, p. 15) described it, “The coding scheme itself represents a hypothesis, even if it is rarely treated as such.” Yet tools are fixed things, appropriate for some purposes and not others. Hypotheses are flexible ideas, custom-built to answer a specific question in a particular setting, sample size, and context.

Like hardware-store tools, research tools are hard to create; represent massive investments in creating, testing, and producing; and are most useful to the world when the sunk-costs of development lead to standardization and wisdom agglomerated from widespread use. In science, this represents a tool’s psychometric soundness. The art of tool creation is to hypothesize how the tool will be used and design it well from the outset, without the luxury of constant adaptation and nuanced change once the tool’s die is cast.

Observational coding of couples and families is a prime example of how the construction of tools can impact science. Given that existing tools often determine what questions researchers can investigate, tools’ implicit hypotheses should be periodically tested and the tools themselves adjusted, as necessary. One of the primary hypotheses in the field of observational couple coding, since its inception in the 1960s, is that the best way to describe (i.e., code) behavior is to split behavioral classes into a multitude of narrowly defined discreet acts. A competing hypothesis is that, even if there is nuance within behavioral classes, research expediency necessitates having no more codes than can actually be analyzed reliably (e.g., infrequent behaviors typically cannot be reliably coded). Furthermore, the realities of data analysis force the researcher to lump the multitudinous codes together despite untold hours of laborious coding. This causes problems, as the lumped combinations often vary from study-to-study for a given system, making comparisons of findings difficult (Heyman, 2001). This tension we label the “lump-versus-split dilemma in couple observational coding.”

In this paper, we combine the data from 14 studies from 11 investigators using the Rapid Marital Interaction Coding System (RMICS; Heyman, 2004) and use both descriptive statistics and exploratory factor analysis (EFA; Gorsuch, 1983) to infer the “sweet spot” between too many and too few codes. Next, we will summarize the history of couple coding and argue for the need for empirical guidance in determining the optimal number of codes for behavioral observation of couples’ conflict discussions.

History of Observational Coding of Family and Couple Conflict

Observation of familial conflict emerged from the behavioral tradition of the 1960s, which posited that each partner’s pleasing and displeasing behaviors are shaped by the other’s contingent responses (i.e., positive and negative reinforcement; e.g., Weiss et al., 1973).

The ur-coding system that influenced all-to-follow was the Family Interaction Coding System (FICS; see Reid, 1978). A group led by Gerald Patterson let the behaviors dictate which should be included: “We were behaviorists, and our strategy was to obtain data first and then develop a theory if one were justified” (Patterson et al., 1992, p. 1). Patterson’s team observed families in their homes while wearing gasmask-like headsets to narrate the behaviors (Patterson, 1982). The 29 most common and/or theoretically important behaviors were included in the FICS. The Marital Interaction Coding System adapted FICS codes for use with couples (Hops et al., 1972); by 1992, it contained 36 codes (Heyman, Weiss, et al., 1995). A somewhat similar, highly influential coding system by Gottman and colleagues, the Couples Interaction Scoring System (CISS; Gottman et al., 1977) had 27 content codes (with three simultaneously scored affect codes) and 25 codes in its “rapid” version (RCISS; Krokoff et al., 1989).

Thus, the field began firmly in the “splitter” camp. However, because ultra-microanalytic codes occur too infrequently in the standard 10-minute observed couple interaction, almost all of the over 200 observational investigations reviewed by Heyman (2001) used code combination, and almost no two studies using the same system grouped codes in the same way. In essence, the motto of research for the first couple of decades was “code now and figure out what the analyzable construct is later,” with the data forcing researchers to lump despite their initial coding inclinations.

Beginning in the 1990s, the second generation of coding systems was developed to code at the construct level. For example, Gottman’s Specific Affect Coding System (SPAFF; Gottman & Krokoff, 1989) used a smaller constellation (16) of discrete affects. Heyman’s (2004) Rapid Marital Interaction Coding System (RMICS) was guided by the results of an EFA of the MICS-IV (Heyman, Eddy, et al., 1995) to create a consolidated system, although it also added codes for attributions and self-disclosure, settling on 11 codes. Yet, even these smaller systems typically were analyzed by idiosyncratically lumping codes together (Heyman, 2001).

Empirical Guidance for Code Consolidation

Idiosyncratic lumping is not surprising, given that the small sample sizes of most couple observational coding studies preclude empirical approaches to consolidation. Rather, the codes may be combined in ways to fit the hypotheses or the data. For example, low-frequency codes in a given situation, such as highly negative codes in a social support task, are often combined with less negative codes or dropped altogether. Two lines of research can give some indication of the “sweet spot” of the advisable number of codes. The first is a study by Heyman et al. (2001) investigating how much time is necessary to reliably observe second-generation system constructs. (“Reliability” here refers to stability or reproducibility, not to inter-observer agreement.) They found that 10–15 minutes was sufficient to estimate observed frequencies of RMICS codes in (a) couples presenting for therapy, (b) non-distressed community couples, and (c) engaged couples. However, the authors noted that there was also substantial variability of couples’ conversations across topics and time, and some specific communication behaviors (e.g., men’s dysphoric affect) required much longer observations to identify stable estimations. Thus, the reliability of observational coding increases with fewer, broader, coding categories.

The second line of research involves the factor analysis of coding systems. Factor analysis is a statistical procedure that examines relationships between variables to identify potential underlying constructs. When applied to coding systems, this generally involves assessing the correlation matrix between the frequency of individual codes in a dataset to determine which communication behaviors co-occur, potentially indicating a smaller number of higher-order communication constructs. Although this method is rooted in statistics, some interpretation is involved; for example, researchers must assess the loadings to name the factor.

Factor analysis has been conducted on a variety of observational coding systems, including the MICS, CISS, and other microanalytic systems (e.g., Heyman, Eddy, et al., 1995; Jacob & Krahn, 1987; Remen et al., 2000; Williamson et al., 2011). Although results differ slightly in each study depending on the specific coding system and type of interactions, factors identified include general positive behaviors (e.g., “Emotional Engagement,” “Humorous Distraction,” “Humor”), general negative behaviors (e.g., “Hostility,” “Negative Evaluation”), and general neutral behaviors (e.g., “Problem-Solving Focus,” “Problem Discussion,” “Responsibility Discussion”). Again, many low-frequency codes are often excluded. In many of these studies, data is derived from a single sample and are thus relatively small, especially for EFA, which benefits from larger sample sizes. As such, although there are certainly similarities between studies using factor analysis, findings from each study may be overfitted.

Aims of this Study

Our goal is to use both descriptive statistics and EFA on a large, multi-study/multi-investigator data set to derive the optimal number of observational communication constructs across studies. Combining data from multiple studies eliminates competing hypotheses that the findings are due to a particular sampling strategy (i.e., random digit dialing, convenience sampling, and purposive sampling are represented), particular strategy for setting up the discussions, or other sample-specific method confounds.

Method

Data from 14 studies (see Online Table 1) were combined. (All studies received Institutional Review Board approval from their home institutions. Because data were not collected for the current study, and we could not link the data back to individual participants, it is not considered human subjects research.) Although each study protocol varied slightly, each study involved a couple conflict discussion task (see Heyman & Slep, 2004). Both partners completed questionnaires, including a report of general demographic data, and completed a measure from which areas of conflict or desired change were isolated (e.g., disagreements about spending habits or household chores). Couples were then asked to talk about the area of conflict for approximately 10 minutes in an unmediated, video-recorded discussion. These 10-minute recorded analog discussions are a standard research procedure (Heyman & Slep, 2004); there is strong evidence that participants quickly habituate to recording and that communication during these discussions is fairly representative of at-home communication patterns (Foster et al. 1997; Wieder & Weiss, 1980). If more than one discussion was collected, we randomly chose one to include in these analyses.

Interactions from all the studies were coded using the RMICS by trained research staff at New York University. As shown in Table 2 , RMICS includes five negative codes, four positive codes, one neutral code, and one “other” code. The basic coding unit is the speaker turn; if a speaker turns lasts longer than 30 s, it is coded in 30 s intervals. Both speaker and listener behavior is coded per unit. Coders assign one code during each unit; if two or more codes are present during a speaker turn, a theoretically derived hierarchy (i.e., negative codes then positive codes then neutral codes) indicates which code to retain. Definitions and examples of each code are shown in Online Table 2. A single code is assigned to each partner during each speaker turn based on a hierarchy of theoretical importance: negative codes first, positive codes next, and neutral codes last. Reliability was within the acceptable range across investigations (G >.70).

Table 2

Exploratory Factor Analysis (EFA) Results and Descriptive Statistics for RMICS Codes

RMICS Code	A priori Valence	Factor 1 (Negative)		Factor 2 (Non-Negative)		Code Frequencies
						Women				Men
						Women	Men	Women	Men	M	SD	Min	Max	M	SD	Min	Max
		Acceptance	+	−0.044	0.056	0.316	0.243	3.40	3.10	0	19	3.25	3.07	0	21
Constructive Problem Discussion	0	0.072	0.095	0.365	0.373	32.36	17.15	0	109	34.16	17.6	1	121
Distress-Maintaining Attributions	−	0.473	0.421	−0.016	0.020	1.27	1.91	0	15	0.83	1.47	0	13
Dysphoric Affect	−	0.085	0.051	−0.088	−0.059	0.28	1.90	0	53	0.03	0.26	0	8
Hostility	−	0.916	0.907	0.003	0.036	5.15	8.94	0	93	3.75	7.19	0	80
Humor	+	−0.047	−0.103	0.725	0.759	6.00	7.50	0	57	4.98	6.60	0	55
Psychological Abuse	−	0.231	0.172	−0.033	−0.019	0.02	0.21	0	5	0.03	0.27	0	6
Relationship-Enhancing Attributions	+	−0.055	−0.084	0.200	0.157	0.30	0.90	0	14	0.32	0.89	0	11
Self-Disclosure	+	−0.013	0.025	0.081	0.020	1.76	2.31	0	23	1.67	2.35	0	23
Withdrawal	−	0.099	0.135	−0.101	−0.061	0.05	0.47	0	9	0.06	0.58	0	11

Note. N = 2011 dyads. Factor loadings greater than .30 appear in bold. + = Positive, − = Negative, 0 = Neutral.

To determine the latent structure of the RMICS coding system, EFAs with principal axis factoring were conducted using 10 RMICS codes; the “other” code, capturing discussion unrelated to personal or relationship topics (e.g., discussion of the study itself), was excluded. Because the factors were not predicted to be orthogonal, an oblique rotation (i.e., promax) was used (Gorsuch, 1983). Principal Axis Factoring was selected because, as expected, observational coding data is non-normal. Because observational codes from partners interacting with each other are dependent and highly correlated, EFAs were conducted separately for men and women.

Results

The final sample included RMICS data from a total of 2011 couples. Table 1 contains demographic information for both women and men. On average, both women and men were in their mid-30s and had been in their relationship for 6–7 years. Almost three-quarters of the sample were married. The vast majority were White, non-Hispanic, and had completed at least some college.

Table 1

Variable	Women		Men
	M	SD	M	SD
	Age (years)	34.20	8.00	36.13	8.76
Relationship length (years)	6.53	7.56	6.81	7.63
n	Valid %	n	Valid %
Race
White	1368	86%	1353	85%
Non-White	227	14%	245	15%
Missing	416	413
Ethnicity
Hispanic/Latino/a	143	8%	150	8%
Non-Hispanic/Latino/a	1682	92%	1656	92%
Missing	186	205
Education level
High school or less	380	22%	502	30%
Some college or college graduate	983	58%	932	55%
At least some postgraduate education	337	20%	266	16%
Missing	311	311
Employment status
Not working	540	38%	126	9%
Working less than full-time	411	29%	60	4%
Working full-time	464	33%	1197	87%
Missing	596	628
Annual personal income
< $15,000	629	44%	126	9%
$15,000 – $35,000	385	27%	264	18%
> $35,000	420	29%	1090	74%
Missing	577	531
Annual family income
< $35,000	168	13%	150	11%
$35,000 – $75,000	505	39%	502	38%
>$75,000	618	48%	671	51%
Missing	720	688
Relationship status
Married or cohabiting	1460	95%	1462	95%
Other	74	5%	72	5%
Missing	477	477

Table 2 contains descriptive statistics for the RMICS codes for both women and men. Notably, several codes were observed to be extremely rare in these samples—specifically, psychological abuse, withdrawal, dysphoric affect, and relationship-enhancing attributions each occurred, on average, less than one time per discussion for both women and men. For men, distress-maintaining attributions also occurred an average of less than once per discussion. By far, the most frequent code for both members of the couple was constructive problem discussion.

The final EFA solutions were determined by the leveling off of eigenvalues on the scree plots (Gorsuch, 1983), with additional consideration of (b) an efficient number of primary loadings (>.30) and (c) interpretable solutions. We retained two factors for both women (accounted for 30.88% of variance) and men (accounted for 29.53% of variance). For both women and men, Factor 1 (Negative) contained two items: distress-maintaining attributions and hostility. Factor 2 (Non-Negative) contained three items for women—constructive problem discussion, humor, and acceptance — but just two items for men, constructive problem discussion and humor. For both men and women, there were no cross-loadings greater than .30. The remaining codes from the RMICS coding system that were included in analyses (i.e., psychological abuse, withdrawal, dysphoric affect, relationship-enhancing attributions, and self-disclosure — did not have a factor loading greater than .30 on either of the factors, for either women or men. Factor 1 (Negative) and Factor 2 (Non-Negative) had very small correlations for both women (r = −.007) and men (r = .064).

We replicated the procedures dividing distressed and non-distressed couples, with nearly identical results. See Online Table 3 for code frequencies and Online Table 4 for factor loadings.

Finally, we tested whether any individual RMICS code provided incremental validity (i.e., significantly higher association with relationship satisfaction) than a priori negative, positive, and neutral codes. (A priori categories were used, rather than strict adherence to the EFA categories, to ensure that even codes with modest factor loadings would still be included). As shown in Online Table 5, (a) the most prevalent negative and positive codes and (b) their respective code categories had equivalent associations with satisfaction. All other categories had significantly higher associations (using Fisher’s r-to-z transformations) with satisfaction than did specific codes. Thus, splitting provided no incremental validity.

Discussion

The results of this study side heavily with the “lump” camp in the lump-versus-split dilemma in couple observational coding. Despite the RMICS being created as a distillation of the even more finely split MICS, across the samples combined here, nearly half of RMICS codes happened less than once per interaction. This infrequency may be due, in part, to the RMICS’s use (like most couple observational systems) of event-based coding, with the coding unit being the speaker turn. Although interval-based coding, in which the coding unit is more frequent (e.g., 5 s), may increase the frequency of these codes, it seems clear that they simply do not happen frequently enough to remain as stand-alone codes. Most researchers should combine RMICS codes into negative and non-negative codes, although splitting the non-negative factor into a priori neutral and positive codes would also be defensible.

This conclusion should be tempered by the fact that specialized codes are still needed to address specific research questions within individual studies. The studies included in our analyses represent a broad swath of investigations on couples’ communication. For this reason, some RMICS codes were likely important as indicators of specific communication behaviors within specialized areas of study. For example, psychological abuse (which overlaps heavily with SPAFF contempt, belligerence, and harsh criticism codes) is a very low-frequency code that is often dropped or collapsed into a negative composite category; however, it may be important to researchers who study intimate partner violence. However, this and other low-frequency codes are essentially vestigial when included in a system for use with general samples of couples. Whether very rare codes happen enough in particular subpopulations and whether coders can achieve sufficient agreement for them is a decision for individual investigators; however, in general, they are probably not worth retaining in general coding systems and likely are still too rare to make them worth including even in specialized investigations.

Such low frequencies of communication behaviors globally make lumping necessary when implementing a validated coding scheme, but ironically also make empirically guided lumping (via EFA) harder. Nevertheless, in our analyses, the EFAs produced similar solutions for men and women: negative (defined by hostility and distress-maintaining attributions) and non-negative (defined by constructive problem discussion and humor, with the addition of acceptance for women). This is consistent with the implicit hypotheses in system creation to which Bakeman and Gottman (1997) were referring. That is, behaviors naturally fall into negative and neutral/positive constructs, even if the ways of combining them has been maddeningly inconsistent (Heyman, 2001). This also aligns with the general negative and positive emotional valences other disciplines have found in processing verbal and nonverbal communication and in studying emotion (e.g., Humbad et al., 2011; Watson et al., 1988; Williamson et al., 2011).

Our results are similar to earlier EFA studies (Heyman, Eddy, et al., 1995; Jacob & Krahn, 1987; Remen et al., 2000; Williamson et al., 2011) in finding a negative factor, which likely has to do with the frequency and salience of negativity during conflict (Gottman, 1979). Our results differ in that our second factor contained both neutral and positive behaviors, whereas others have sometimes found separate positive factors. Neutral problem discussion is by far RMICS’ most frequent code, leaving infrequent and scattered positive behavior that may not converge as well.

Men’s and women’s EFA loadings were similar. Acceptance had a slightly higher loading for women (.316) than for men (.232), which cannot be explained by a notable difference in frequencies, which were very close. For women, this put Acceptance above the .30 factor loading interpretation threshold. Given the sample size, this slight difference in loadings is likely replicable. However, given the similar ordinal loadings for all codes on both factors for men and women, it does not appear to be of considerable practical import.

There are several notable strengths of this paper. First, it is the largest study using EFA with couples coding, including over 2000 couples. Second, it used data from 14 separate studies from 11 different laboratories, ensuring great breadth in data collection methods, research questions, geographic regions within the U.S., and sampling frames. Third, all coding was done at a single laboratory, limiting variability due to coding implementation, but still maintaining the variability of four coding trainers and dozens of coders over a decade. Fourth, the combined sample included a notable percentage of partners who had a high school education or less (22% for men and 29% for women), although this is still lower than the U.S. generally (40% and 37%, respectively, U.S. Census Bureau, 2020).

Several limitations should also be noted. First, the omnibus data set is younger and has lower proportions of people of color and cohabiting couples than the corresponding U.S. population of committed mixed-sex couples. Replication in more generalizable samples, as well as those solely from particular non-White subpopulations and same-sex couples, is needed. Second, all of the studies were within the U.S. and may not generalize to other English-speaking countries or other parts of the world. Third, the results are from a single coding system, the RMICS. Although the RMICS is similar to several other coding systems (Kerig & Baucom, 2004) and the EFA results are similar to those of other systems, the results still do not fully translate to all microanalytic systems and EFAs with those systems, especially the SPAFF, are needed (see Sommer, 2017). Third, to conclusively test the inference that broad negative, neutral, and positive (or non-negative) codes would provide incremental validity improvements to the RMICS, one coding team would have to code a data set with the reduced codes and another team use the original RMICS, because construct-level scores coded directly may differ from combining underlying RMICS codes. Finally, although a major strength of this study was the large dataset aggregated across studies with different research questions, this may have obscured important variability. Previous research has shown that context of the interaction, (e.g., speaker motivation, potential for a problem to be resolved) can be important in the interpretation of specific behaviors (Overall & McNulty, 2017). For example, negative communication behaviors, such as distress-maintaining attributions, may actually produce positive outcomes when they point out important areas that both partners are willing to change. As such, we echo the need for well-designed studies and rigorous methods that can parse and analyze valid observed data given potential contextual issues and covarying constructs within the couple, as well as potential biases in the coding team (Baucom et al., 2017).

Implications and Conclusions

Given the resources needed to implement a communication coding system with good reliability, validity, and inter-rater agreement, it makes sense to maintain a more parsimonious scheme. Coding often takes months of training, which can be costly in terms of time, money, or both, and more nuanced or infrequent codes (a) often lead to coders struggling to achieve reliability with (b) no apparent payoff in incremental validity (at least related to the most commonly studied outcome, relationship satisfaction). Although technological developments hold promise for automated coding in the future, it is unlikely that, at least in the near future, even automated coding systems developed with machine learning will be able to achieve good specificity and sensitivity for infrequent codes without expensive training (Reblin et al., 2016, 2018). Another option is the use of global coding systems, which have a long tradition in the field (e.g., Kerig & Baucom, 2004). Microanalytic systems may be preferable when particular sequences are important, but many behaviorally focused research questions can be reliably and validly studied with microanalytic coding, with the added advantage that some constructs (e.g., progress in problem-solving) can only be operationalized via microanalytic approaches.

Researchers need to be cautious about integrating purpose-driven codes into more global communication coding schemes. We often teach our students that the tools used in research methodology should best reflect the research question. Although specific low-frequency communication behaviors can be important indicators of psychosocial processes, these behaviors should be split out only for hypothesis-driven investigations of those processes, and they would need to occur frequently enough in the video-recorded setting to make their inclusion worthwhile. More general couple communication research that fits within the tradition of the analogue couple conflict discussion paradigm should refrain from using valuable coding resources to capture more fine-grained behavior, which almost invariably are combined or discarded. Furthermore, a more uniform, simplified coding system offers the opportunity to make comparisons across datasets and leverage multiple small datasets, which often cannot be accomplished when micro-behaviors are not consistently combined across projects.

Whither couple observational coding? Convergence of factor analysis results implies that the microanalytic “splitting” era in should draw to a close, with future studies instead focused on negative, neutral, and positive codes.