Which economic, regional, and demographic characteristics explain differences in average state wages?
This study examines how demographic composition, labor market conditions, and regions in the U.S. influence economic outcomes across U.S. states. Using state-level data from the American Community Survey (ACS) spanning 2000–2023, found here: LINK, this analysis focuses on two key measures of economic performance: average hourly wages and household income. State-year observations were created by computing weighted averages of individual-level ACS data, capturing variation in employment status, age, gender, race, citizenship, and labor force participation.
To look into these relationships, I used multiple linear regression models that relate regional and labor market characteristics to wages and income, while controlling for time trends and demographic chaaracteristics. The results show that employment rates and regions, and time are among the strongest predictors of both hourly wages and household income. Demographic factors such as age structure, gender composition, and race also play meaningful roles, though their effects are smaller than those of labor market conditions. There are also significant regional differences that exist even after accounting for demographics and employment, with the Northeast and West states having higher wages and income levels than other regions.
Overall, the findings highlight the importance of labor market strength and regional economic structure in shaping economic outcomes in U.S. states, offering insight and further analysis relevant to labor economics and public policy.
How do demographic characteristics influence hourly wages and household income across U.S. states from 2000 to 2023?
How strongly do employment and unemployment rates predict state-level hourly wages and household income after accounting for demographic characteristics?
Do U.S. regions differ significantly in hourly wages and household income, even after controlling for demographic and employment characteristics?
States in the U.S. show large differences in wages, incomes, and labor market outcomes. At the same time, states differ in their demographic makeup and in how many residents are employed, unemployed, or out of the labor force. These differences raise important questions about what drives economic success at the state level.
Wages vary across states for several reasons. States with higher concentrations of productivity industries such as technology, finance, or infrastructure tend to pay higher wages than states that rely more on agriculture, manufacturing, or other low-wage services. Differences in education levels, cost of living, and state policies also contribute to wage gaps. This also applies to the different regions across the U.S. as cultural norms have been adopted differently across the country. Labor economics research shows that higher educated workforces generally earn more and experience stronger long-term wage growth.
Demographics also further shape state economies. Characteristics such as age, gender, race, and citizenship affect labor-force participation and income. For example, older populations often have higher income due to accumulated wealth and greater experience, while racial and citizenship disparities can reflect inequalities documented in labor-market studies. Because economic structure and population characteristics vary widely across states, analyzing how these factors relate to wages provides insight into regional inequality and the conditions that support stronger labor-market performance.
The distribution of hourly wages appears to be normally distributed, centered around $14-$16 dollars per hour. Most states had an average hourly wage of around $12-$20 as most observations fall in this range. There are also no extreme outliers or tails that suggests linear regression using hourly wages as the response variable is appropriate. The distribution of average household income appears to be right skewed, with the most observations being in the $60,000-$100,000 range. This skew shows that there are income inequalities across states. With this not being normally distributed, I will transform this model by logging this variable for regression analysis.
The scatterplot of average wage and employment rate shows a negative linear trend, which is surprising. This could indicate that states with higher employment rates could have more lower wage jobs and that employment rate may not be a strong predictor in the regression model.
Now looking into demographic relationships to wages and income, age is positively related to wages shown through the plot. This is due to older workers having more experience which causes higher earnings, following labor economics theory. Age could be a significant predictor in the regression models. When looking into wages and the white population in states, there was a slightly negative relationship. This could reflect different regional patterns as some predominately white states could have lower wages (such as southern states). The relationship between black populations and wages appears to have no trend in the plot. This shows that this demographic alone does not predict wages.
Over time average wages and average household income in states have increased. This is due to inflation and increased productivity following economic theory.
The data used in this study comes from the ACS (American Community Survey), spanning the years 2000 through 2023, including individual observations by state-year.
I had to compute weighted averages by state-year for all of the demographic, employment, and economic output (wages and income) variables as the original data was at the individual level. By weighting the average for each state-year, I was able to find averages for each state-year for analysis. This result contains one observation for state-year, giving population percentages for each variable.
To capture regional differences, U.S. states were classified into four different regions: Northeast, Midwest, South, and West, using FIPS state codes. This allowed for analysis of geographic differences in wage and income after controlling for demographic and labor market characteristics.
Here are variable descriptions:
YEAR: calendar year (2000–2023)
STATEFIP: numeric state identifier
avg_hourly_wage: mean of individual hourly wages
median_wage: median hourly wage
avg_household_income: mean of household income
avg_age: mean age
pct_female: percent of respondents who are female
pct_citizen: percent who are U.S. citizens.
pct_white and pct_black: percents of population identified as white or Black
pct_employed, pct_unemployed, pct_not_in_labor_force: percents in each labor-force status category
sample_size: number of observations used in that state-year
region: region of state either being Northeast, Midwest, South, or West
Call:
lm(formula = avg_hourly_wage ~ pct_employed + pct_unemployed +
pct_female + pct_white + pct_black + avg_age + pct_citizen +
YEAR + region, data = state_summary)
Residuals:
Min 1Q Median 3Q Max
-4.1807 -0.9175 -0.0705 0.8521 5.6599
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.478e+02 1.826e+01 -40.942 < 2e-16 ***
pct_employed 1.071e+01 1.253e+00 8.548 < 2e-16 ***
pct_unemployed 1.804e+01 3.451e+00 5.227 2.03e-07 ***
pct_female -3.664e+01 7.115e+00 -5.149 3.05e-07 ***
pct_white -6.883e-01 4.282e-01 -1.607 0.1083
pct_black 4.315e+00 7.158e-01 6.029 2.19e-09 ***
avg_age 1.495e-01 3.232e-02 4.625 4.15e-06 ***
pct_citizen -5.039e+00 7.365e-01 -6.842 1.23e-11 ***
YEAR 3.852e-01 8.433e-03 45.682 < 2e-16 ***
regionMidwest -1.209e+00 1.454e-01 -8.314 2.45e-16 ***
regionSouth -1.455e+00 1.614e-01 -9.018 < 2e-16 ***
regionWest -2.788e-01 1.668e-01 -1.672 0.0948 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.345 on 1212 degrees of freedom
(102 observations deleted due to missingness)
Multiple R-squared: 0.8248, Adjusted R-squared: 0.8232
F-statistic: 518.8 on 11 and 1212 DF, p-value: < 2.2e-16
Call:
lm(formula = log(avg_household_income) ~ pct_employed + pct_unemployed +
pct_female + pct_white + pct_black + avg_age + pct_citizen +
YEAR + region, data = state_summary)
Residuals:
Min 1Q Median 3Q Max
-0.250929 -0.066458 -0.001841 0.063644 0.295075
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.377e+01 1.143e+00 -47.041 < 2e-16 ***
pct_employed 2.068e+00 8.101e-02 25.521 < 2e-16 ***
pct_unemployed 9.562e-01 2.286e-01 4.183 3.07e-05 ***
pct_female -3.986e-01 4.557e-01 -0.875 0.3818
pct_white -1.814e-01 2.783e-02 -6.520 9.99e-11 ***
pct_black 1.860e-01 4.655e-02 3.996 6.79e-05 ***
avg_age 3.659e-03 2.089e-03 1.751 0.0801 .
pct_citizen -9.240e-01 4.764e-02 -19.396 < 2e-16 ***
YEAR 3.227e-02 5.278e-04 61.137 < 2e-16 ***
regionMidwest -1.191e-01 9.502e-03 -12.531 < 2e-16 ***
regionSouth -1.408e-01 1.051e-02 -13.395 < 2e-16 ***
regionWest -9.889e-02 1.086e-02 -9.103 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09107 on 1314 degrees of freedom
Multiple R-squared: 0.8947, Adjusted R-squared: 0.8938
F-statistic: 1015 on 11 and 1314 DF, p-value: < 2.2e-16
To examine this relationship between economic outcomes and demographics, employment status, and regions, this study uses multiple linear regression models. The two primary dependent variables are hourly wages and average household income by state-year. Model 1 uses average hourly wages as the dependent variable, while Model 2 uses the natural log of average household income as the dependent, due to the analysis from EDA and making interpretation easier (through approximate percent changes).
Both models include measures of employment such as employment and unemployment rates as well as demographic controls such as: age, gender, race, and citizenship status makeups in states. A time trend by year is also added to control for the changes in wages and income over time (increases due to inflation, showed in EDA). Regional differences are also accounted for using the Northeast, South, West, and Midwest regions as the Northeast region is the reference category in both models.
Data cleaning and variable construction was performed mainly using the tidyverse package. The regression models and diagnostics were created using basic R functions.
Model 1 examines how demographic, employment, and regional characteristics explain average hourly wages across U.S. states. The employment rate has a positive effect with its coefficient of 10.71 indicating that a 1 percentage point increase in a state’s employment rate is associated with an increase of about $0.11 in the average hourly wage, holding all other variables constant. Unemployment also carries a positive coefficient of 18.04, suggesting that states experiencing higher unemployment may be raising wages to attract or retain workers, an effect that aligns with short-run labor-market theory.
Regional differences are also significant. Relative to the Northeast (the omitted category), states in the Midwest have hourly wages that are about $1.21 lower, and states in the South average $1.46 lower, even after controlling for demographics, labor-force characteristics, and year. The West shows the smallest gap at a wage of –$0.28 compared to the Northeast, implying that Western states are closer to Northeastern wage levels once the controls are accounted for. These findings indicate that there is meaningful regional wage disparities independent of workforce composition and demographic makeup. Overall, the Northeast and West on average have significantly higher hourly wages than the Midwest and South.
Several demographic characteristics also contribute to differences in state-level wages. The share of females in the state has a significant negative coefficient of –36.64, implying that a 1 percentage point increase in the female population share is associated with roughly a $0.37 decrease in average wages, consistent with gender wage disparities across states. The percentage of the black population is strongly positive (4.32), meaning a 1 percentage point increase in the Black population share corresponds to about a $0.04 increase in wages, while the percent of the white population is not statistically significant. Average age has a positive effect as each additional year in average age predicts a $0.15 increase in the average hourly wage.
Overall, with an R-squared of 0.825, Model 1 explains a large share of hourly wage variation in US states from 2000-2023. Employment condition and regions emerge as the most influential drivers of wage differences, supplemented by other meaningful demographic effects.
Model 2 uses the natural log of household income as the dependent variable, meaning the coefficients can be interpreted as approximate percentage changes. This model fits extremely well, with an R-squared of 0.895, indicating that employment, demographics, regional factors, and time trends collectively explain 89.5% of the variation in household income.
The coefficient on the employment rate of 2.068 implies that a 1 percentage point increase in the employment rate is associated with about a 2.1% increase in average household income, holding other factors constant. Unemployment also predicts higher income, with a coefficient of 0.956, suggesting that income tends to rise in periods or states where unemployment is elevated, which is again consistent with the results from Model 1.
Regional effects are highly significant in this model as well. Compared to the Northeast, household incomes are 11.9% lower in the Midwest, 14.1% lower in the South, and 9.9% lower in the West, even after controlling for demographics, employment conditions, and time. This shows that there are apparent regional income gaps and that they are not simply a function of workforce makeup or labor-market strength.
Demographic characteristics also shape household income. Percent of the white population has a significant negative coefficient of –0.181, meaning a 1 percentage point increase in the white population share predicts a 0.18% decrease in income, consistent with the fact that some of the highest-income states are more racially diverse from the EDA and Model 1. The black population shows the opposite pattern as each percentage point increase in the Black population is associated with a 0.19% increase in income. The female population is not statistically significant, but average age again has a positive coefficient, implying states with older populations tend to have higher incomes.
The year variable is very strong as the coefficient of 0.032 implies income increases by roughly 3.2% per year, capturing long-term growth from inflation, productivity, and the rising cost of living.
Model 2 shows that household income is shaped by labor markets, sustained income growth over time, and meaningful regional differences. Demographic factors play a less but still important role in explaining state-level variation in household income.
GVIF Df GVIF^(1/(2*Df))
pct_employed 2.248762 1 1.499587
pct_unemployed 1.463316 1 1.209676
pct_female 2.594512 1 1.610749
pct_white 2.379009 1 1.542404
pct_black 4.304260 1 2.074671
avg_age 2.728276 1 1.651749
pct_citizen 1.589988 1 1.260947
YEAR 2.305429 1 1.518364
region 6.703820 3 1.373158
Model 2
GVIF Df GVIF^(1/(2*Df))
pct_employed 2.177727 1 1.475712
pct_unemployed 1.521360 1 1.233434
pct_female 2.592841 1 1.610230
pct_white 2.373192 1 1.540517
pct_black 4.258106 1 2.063518
avg_age 2.887566 1 1.699284
pct_citizen 1.601194 1 1.265383
YEAR 2.505556 1 1.582895
region 6.685219 3 1.372522
Model 1
The diagnostic plots for Model 1 show that the regression fits reasonably well, though some assumptions are only partially met. The Residuals vs. Fitted plot shows linearity in the relationship between the predictors and hourly wages as the line remains mostly on the horiztonal plane, shwing linearity. The Scale–Location plot also indicates the line staying mostly on plane, satisfying this condition of constant variance. However, the Q–Q plot shows that most residuals follow a normal distribution, with only slight deviation in the upper tail which could question the condition of normality. Finally, the Residuals vs. Leverage plot reveals no influential outliers, as no points approach high Cook’s distance values. Overall, Model 1 performs adequately, but the diagnostics suggest a very small departure from normality that should be noted when interpreting results.
Model 2
The diagnostic plots for Model 2 indicate that the log transformation of household income substantially improved the model. The Residuals vs. Fitted plot shows no meaningful curvature, suggesting that the condition of linearity is met. The Scale–Location plot demonstrates constant variance across fitted values, while the Q–Q plot shows residuals closely following the normal line with no deviation at the tails, meeting the condition of normality. The Residuals vs. Leverage plot also indicates that there are no outliers or influential points past the Cook’s distance. Together, these diagnostics show that Model 2 meets regression assumptions much more strongly than Model 1 and is a good predictor of economic outcome, in this case being average household incomes by state.
Multicollinearity
To evaluate multicollinearity among predictors in my models, I am used generalized VIF (GVIF) values because the model includes categorical predictors like regions. After converting the GVIF to an interpretable adjusted scale, I can look into multicollinearity.
The VIF results for Model 1 show no evidence of harmful multicollinearity, as all of the VIF fall well below 5. The employment variables central to the model such as pct_employed with a VIF of 1.5 and pct_unemployed with a VIF of 1.21 display very low multicollinearity, meaning their effects can be interpreted reliably. Demographic variables with VIFs such as pct_black (2.07), pct_white (1.54), pct_female (1.61), and avg_age (1.65) also display low levels of multicollinearity, which is expected given that demographic characteristics often are different across states. The region factor also shows a low adjusted GVIF of 1.37, indicating it is not strongly correlated with demographics or employment status. Overall, Model 1’s predictors are sufficiently independent, and multicollinearity does not threaten the validity of this model.
Model 2 shows a nearly identical multicollinearity pattern, again indicating no problematic overlap among predictors. Employment variables show low VIF values, with pct_employed (1.48) and pct_unemployed (1.23) displaying low multicollinearity. The demographic predictors also exhibit low adjusted GVIFs as: pct_black (2.06), pct_female (1.61), pct_white (1.54), and avg_age (1.70) all remain below the level of 5. The region factor again shows a low GVIF (1.37), confirming that regional differences are not related to the effects of other variables. Because essentially all predictors remain below a GVIF of 2, Model 2’s coefficients can be interpreted with confidence, and multicollinearity is not a concern for model 2.
This study finds clear differences in economic outcomes across U.S. states that are tied to labor market conditions, demographic compositions, and regional location. One of the main findings is the strong association between employment levels (employed and unemployment rates) and economic performance (wages and income). States with higher proportions of employed individuals tend to exhibit higher average wages and household incomes, showing the importance of labor force participation and how it helps generate economic growth in states.
Geographic location also emerges as a key factor shaping economic outcomes. States in the Northeast and West generally display higher wages and household incomes compared to those in the Midwest and South. These differences persist over time and reflect structural problems such as industry concentration, urbanization, and regional economic development. The results suggest that regional economic advantages continue to play an important role in shaping opportunities and living standards across the country.
Demographic patterns provide additional context for these economic differences. States with older average populations tend to have higher incomes, as they have more work experience and progress through their careers over time. Variation in gender composition, racial composition, and citizenship rates is also associated with differences in economic outcomes, reflecting broader patterns of migration, workforce composition, and labor market sorting.
Overall, the findings emphasize that strong labor markets and regional factors are central drivers of wage and income disparities across states. These insights have important implications for policymakers seeking to promote economic growth, reduce inequality, and improve labor market outcomes at the state level.
This study has several limitations. Because the analysis uses state-level averages, it cannot capture differences within states, such as variation between urban and rural areas (as I have one observation per state-year). The results describe relationships rather than effects as other outside factors like state policies, cost of living, and industries relevant to the area may also influence wages and income. Including these variables in future studies would be beneficial to capture the full effects. Also, my data only spans from 2000-2023, which leaves out other historical trends from the past that could be relevant to explain this relationship. Finally, the regional groupings are broad and may mask important differences among states within the same region. Capturing more regions would give better results for future studies.
Team, MPC UX/UI. “U.S. Census Data for Social, Economic, and Health Research.” IPUMS USA, usa.ipums.org/usa/. Accessed 10 Dec. 2025.
U.S. Census Bureau. (2023). Census regions and divisions of the United States. https://www.census.gov/
---
title: "US State Wage and Income Analysis"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: default
navbar-bg: "darkblue"
orientation: columns
vertical_layout: fill
source_code: embed
---
```{=html}
<head>
<base target="_blank">
</head>
```
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(DT)
library(plotly)
library(pacman)
library(DataExplorer)
library(car)
library(maps)
pacman::p_load(data.table, tidyverse)
# Use the fread function in the package data.table for a large dataset
df <- fread("~/Documents/mth369/cps_00004.csv")
glimpse(df)
# Clean data
df_clean <- df %>%
mutate(
HOURWAGE = ifelse(HOURWAGE >= 999, NA, HOURWAGE),
HHINCOME = ifelse(HHINCOME >= 9999999, NA, HHINCOME),
AGE = ifelse(AGE > 90, NA, AGE),
SEX = factor(SEX, levels = c(1, 2), labels = c("Male", "Female")),
RACE = factor(RACE),
CITIZEN = factor(CITIZEN),
EMPSTAT = factor(EMPSTAT)
)
df_clean <- df_clean %>%
mutate(
emp_group = case_when(
EMPSTAT %in% c(10, 12) ~ "employed",
EMPSTAT %in% c(21, 22) ~ "unemployed",
EMPSTAT %in% c(32, 34, 36) ~ "not_in_labor_force",
TRUE ~ NA_character_
)
)
# Aggregate by state-year (using person weight ASECWT)
state_summary <- df_clean %>%
group_by(YEAR, STATEFIP) %>%
summarize(
avg_hourly_wage = weighted.mean(HOURWAGE, ASECWT, na.rm = TRUE),
median_wage = median(HOURWAGE, na.rm = TRUE),
avg_household_income = weighted.mean(HHINCOME, ASECWT, na.rm = TRUE),
avg_age = weighted.mean(AGE, ASECWT, na.rm = TRUE),
pct_female = weighted.mean(SEX == "Female", ASECWT, na.rm = TRUE),
pct_citizen = weighted.mean(CITIZEN == 1, ASECWT, na.rm = TRUE),
pct_white = weighted.mean(RACE == 100, ASECWT, na.rm = TRUE),
pct_black = weighted.mean(RACE == 200, ASECWT, na.rm = TRUE),
pct_employed = weighted.mean(emp_group == "employed", ASECWT, na.rm = TRUE),
pct_unemployed = weighted.mean(emp_group == "unemployed", ASECWT, na.rm = TRUE),
pct_not_in_labor_force = weighted.mean(emp_group == "not_in_labor_force", ASECWT, na.rm = TRUE),
sample_size = n() # optional: count of people in that state-year
) %>%
ungroup()
glimpse(state_summary)
state_summary <- state_summary %>%
mutate(
region = case_when(
# Northeast
STATEFIP %in% c(9, 23, 25, 33, 44, 50, 34, 36, 42) ~ "Northeast",
# Midwest
STATEFIP %in% c(17, 18, 26, 39, 55, 19, 20, 27, 29, 31, 38, 46) ~ "Midwest",
# South (includes DC)
STATEFIP %in% c(1, 21, 28, 47, 5, 22, 40, 48,
10, 11, 12, 13, 24, 37, 45, 51, 54) ~ "South",
# West
STATEFIP %in% c(4, 8, 16, 30, 32, 35, 49, 56,
2, 6, 15, 41, 53) ~ "West",
TRUE ~ NA_character_
),
region = factor(region, levels = c("Northeast", "Midwest", "South", "West"))
)
```
Introduction
===
Column {data-width=650}
---
### Title and Abstract
**Which economic, regional, and demographic characteristics explain differences in average state wages?**
This study examines how demographic composition, labor market conditions, and regions in the U.S. influence economic outcomes across U.S. states. Using state-level data from the American Community Survey (ACS) spanning 2000–2023, found here: [LINK](https://usa.ipums.org/usa/), this analysis focuses on two key measures of economic performance: average hourly wages and household income. State-year observations were created by computing weighted averages of individual-level ACS data, capturing variation in employment status, age, gender, race, citizenship, and labor force participation.
To look into these relationships, I used multiple linear regression models that relate regional and labor market characteristics to wages and income, while controlling for time trends and demographic chaaracteristics. The results show that employment rates and regions, and time are among the strongest predictors of both hourly wages and household income. Demographic factors such as age structure, gender composition, and race also play meaningful roles, though their effects are smaller than those of labor market conditions. There are also significant regional differences that exist even after accounting for demographics and employment, with the Northeast and West states having higher wages and income levels than other regions.
Overall, the findings highlight the importance of labor market strength and regional economic structure in shaping economic outcomes in U.S. states, offering insight and further analysis relevant to labor economics and public policy.
### Research Questions
1. How do demographic characteristics influence hourly wages and household income across U.S. states from 2000 to 2023?
2. How strongly do employment and unemployment rates predict state-level hourly wages and household income after accounting for demographic characteristics?
3. Do U.S. regions differ significantly in hourly wages and household income, even after controlling for demographic and employment characteristics?
Column {data-width=350}
---
### Background and Significance
States in the U.S. show large differences in wages, incomes, and labor market outcomes. At the same time, states differ in their demographic makeup and in how many residents are employed, unemployed, or out of the labor force. These differences raise important questions about what drives economic success at the state level.
Wages vary across states for several reasons. States with higher concentrations of productivity industries such as technology, finance, or infrastructure tend to pay higher wages than states that rely more on agriculture, manufacturing, or other low-wage services. Differences in education levels, cost of living, and state policies also contribute to wage gaps. This also applies to the different regions across the U.S. as cultural norms have been adopted differently across the country. Labor economics research shows that higher educated workforces generally earn more and experience stronger long-term wage growth.
Demographics also further shape state economies. Characteristics such as age, gender, race, and citizenship affect labor-force participation and income. For example, older populations often have higher income due to accumulated wealth and greater experience, while racial and citizenship disparities can reflect inequalities documented in labor-market studies. Because economic structure and population characteristics vary widely across states, analyzing how these factors relate to wages provides insight into regional inequality and the conditions that support stronger labor-market performance.
Data
===
```{r}
DT::datatable(state_summary, rownames = FALSE, options = list(
columnDefs = list(list(className = 'dt-center',
targets = 1:5)), pageLength = 10))
```
EDA
===
Column {.tabset data-width=700}
---
### Hourly Wage Distribution
```{r}
ggplot(state_summary, aes(x = avg_hourly_wage)) +
geom_histogram(binwidth = 0.5, color = "black", fill = "blue") +
labs(title = "Distribution of Average Hourly Wages",
x = "Average Hourly Wage", y = "Count")
```
### Household Income Distribution
```{r}
ggplot(state_summary, aes(x = avg_household_income)) +
geom_histogram(binwidth = 5000, color = "black", fill = "blue") +
labs(title = "Distribution of Average Household Income",
x = "Average Household Income", y = "Count")
```
### Average Wage and Employment Rate
```{r}
ggplot(state_summary, aes(x = pct_employed, y = avg_hourly_wage)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Employment Rate and Average Wages",
x = "Percent Employed", y = "Average Hourly Wage")
```
### Age vs. Wages
```{r}
ggplot(state_summary, aes(x = avg_age, y = avg_hourly_wage)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Relationship Between Average Age and Wages",
x = "Average Age", y = "Average Hourly Wage")
```
### White Race % Vs. Wages
```{r}
ggplot(state_summary, aes(x = pct_white, y = avg_hourly_wage)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Percent White and Wages",
x = "Percent White", y = "Average Hourly Wage")
```
### Black Race % Vs. Wages
```{r}
ggplot(state_summary, aes(x = pct_black, y = avg_hourly_wage)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Percent Black and Wages",
x = "Percent Black", y = "Average Hourly Wage")
```
### Average Wages Over Time
```{r}
state_summary %>%
group_by(YEAR) %>%
summarize(mean_wage = mean(avg_hourly_wage, na.rm = TRUE)) %>%
ggplot(aes(x = YEAR, y = mean_wage)) +
geom_line() +
labs(title = "Average State Hourly Wages Over Time",
x = "Year", y = "Average hourly wage")
```
### Average Household Income Over Time
```{r}
state_summary %>%
group_by(YEAR) %>%
summarize(mean_income = mean(avg_household_income, na.rm = TRUE)) %>%
ggplot(aes(x = YEAR, y = mean_income)) +
geom_line(color = "blue", size = 1) +
labs(title = "Average Household Income Over Time",
x = "Year",
y = "Average Household Income")
```
Column {.tabset data-width=300}
---
### Analysis
The distribution of hourly wages appears to be normally distributed, centered around $14-$16 dollars per hour. Most states had an average hourly wage of around $12-$20 as most observations fall in this range. There are also no extreme outliers or tails that suggests linear regression using hourly wages as the response variable is appropriate. The distribution of average household income appears to be right skewed, with the most observations being in the $60,000-$100,000 range. This skew shows that there are income inequalities across states. With this not being normally distributed, I will transform this model by logging this variable for regression analysis.
The scatterplot of average wage and employment rate shows a negative linear trend, which is surprising. This could indicate that states with higher employment rates could have more lower wage jobs and that employment rate may not be a strong predictor in the regression model.
Now looking into demographic relationships to wages and income, age is positively related to wages shown through the plot. This is due to older workers having more experience which causes higher earnings, following labor economics theory. Age could be a significant predictor in the regression models. When looking into wages and the white population in states, there was a slightly negative relationship. This could reflect different regional patterns as some predominately white states could have lower wages (such as southern states). The relationship between black populations and wages appears to have no trend in the plot. This shows that this demographic alone does not predict wages.
Over time average wages and average household income in states have increased. This is due to inflation and increased productivity following economic theory.
### Data Description and Variable Information
The data used in this study comes from the ACS (American Community Survey), spanning the years 2000 through 2023, including individual observations by state-year.
I had to compute weighted averages by state-year for all of the demographic, employment, and economic output (wages and income) variables as the original data was at the individual level. By weighting the average for each state-year, I was able to find averages for each state-year for analysis. This result contains one observation for state-year, giving population percentages for each variable.
To capture regional differences, U.S. states were classified into four different regions: Northeast, Midwest, South, and West, using FIPS state codes. This allowed for analysis of geographic differences in wage and income after controlling for demographic and labor market characteristics.
***Here are variable descriptions:***
YEAR: calendar year (2000–2023)
STATEFIP: numeric state identifier
avg_hourly_wage: mean of individual hourly wages
median_wage: median hourly wage
avg_household_income: mean of household income
avg_age: mean age
pct_female: percent of respondents who are female
pct_citizen: percent who are U.S. citizens.
pct_white and pct_black: percents of population identified as white or Black
pct_employed, pct_unemployed, pct_not_in_labor_force: percents in each labor-force status category
sample_size: number of observations used in that state-year
region: region of state either being Northeast, Midwest, South, or West
Map EDA
===
Column {.tabset data-width=1000}
---
### Map of State Wage 2023 Averages
```{r}
valid_states <- setdiff(1:56, c(3, 7, 14, 43, 52, 72))
state_lookup <- tibble::tribble(
~STATEFIP, ~state_name,
1, "alabama",
2, "alaska",
4, "arizona",
5, "arkansas",
6, "california",
8, "colorado",
9, "connecticut",
10, "delaware",
11, "district of columbia",
12, "florida",
13, "georgia",
15, "hawaii",
16, "idaho",
17, "illinois",
18, "indiana",
19, "iowa",
20, "kansas",
21, "kentucky",
22, "louisiana",
23, "maine",
24, "maryland",
25, "massachusetts",
26, "michigan",
27, "minnesota",
28, "mississippi",
29, "missouri",
30, "montana",
31, "nebraska",
32, "nevada",
33, "new hampshire",
34, "new jersey",
35, "new mexico",
36, "new york",
37, "north carolina",
38, "north dakota",
39, "ohio",
40, "oklahoma",
41, "oregon",
42, "pennsylvania",
44, "rhode island",
45, "south carolina",
46, "south dakota",
47, "tennessee",
48, "texas",
49, "utah",
50, "vermont",
51, "virginia",
53, "washington",
54, "west virginia",
55, "wisconsin",
56, "wyoming"
)
# Merge lookup into the original dataset
state_summary <- state_summary %>%
left_join(state_lookup, by = "STATEFIP")
# Filter out territories before mapping
state_summary <- state_summary %>%
filter(STATEFIP %in% state_lookup$STATEFIP)
us_states <- map_data("state") %>%
rename(state = region)
map_df <- us_states %>%
left_join(
state_summary %>% filter(YEAR == 2023),
by = c("state" = "state_name")
)
# round values to 2 decimal places
map_df$avg_hourly_wage <- round(map_df$avg_hourly_wage, 2)
map_df$state <- str_to_title(map_df$state)
ggplot(map_df, aes(long, lat)) +
geom_polygon(aes(group = group,
fill = avg_hourly_wage,
text = paste0(state, ":\n",
avg_hourly_wage, " dollars")),
colour = "white") +
coord_fixed(1.3) +
scale_fill_viridis_c(option = "plasma") +
labs(title = "Average Hourly Wage by State, 2023") +
theme_void() -> p1
ggplotly(p1, tooltip = "text")
```
### Map of State Hourly Wages from 2000-2023
```{r}
# Long-run (2000–2023) average hourly wage by state
state_longrun <- state_summary %>%
filter(YEAR >= 2000, YEAR <= 2023) %>%
group_by(STATEFIP, state_name) %>%
summarize(
avg_hourly_wage = mean(avg_hourly_wage, na.rm = TRUE),
.groups = "drop"
)
us_states <- map_data("state") %>%
rename(state = region)
map_df <- us_states %>%
left_join(
state_longrun,
by = c("state" = "state_name")
)
# round values to 2 decimal places
map_df$avg_hourly_wage <- round(map_df$avg_hourly_wage, 2)
map_df$state <- str_to_title(map_df$state)
ggplot(map_df, aes(long, lat)) +
geom_polygon(aes(group = group,
fill = avg_hourly_wage,
text = paste0(state, ":\n",
avg_hourly_wage, " dollars")),
colour = "white") +
coord_fixed(1.3) +
scale_fill_viridis_c(option = "plasma") +
labs(title = "Average Hourly Wage by State, 2000–2023") +
theme_void() -> p_long
ggplotly(p_long, tooltip = "text")
```
### Map of Average Household Income in 2023
```{r}
# Make sure map data is ready
us_states <- map_data("state") %>%
rename(state = region)
# Merge 2023 state-level household income into map data
map_income_2023 <- us_states %>%
left_join(
state_summary %>% filter(YEAR == 2023),
by = c("state" = "state_name")
)
# Clean labels and round values
map_income_2023$avg_household_income <- round(map_income_2023$avg_household_income, 0)
map_income_2023$state <- stringr::str_to_title(map_income_2023$state)
# Plot: Average Household Income by State, 2023
p_income_2023 <- ggplot(map_income_2023, aes(long, lat)) +
geom_polygon(aes(group = group,
fill = avg_household_income,
text = paste0(state, ":\n$",
format(avg_household_income, big.mark = ","))),
colour = "white") +
coord_fixed(1.3) +
scale_fill_viridis_c(option = "plasma") +
labs(title = "Average Household Income by State, 2023",
fill = "Income ($)") +
theme_void()
ggplotly(p_income_2023, tooltip = "text")
```
### Map of Average Household Income from 2000-2023
```{r}
# Long-run (2000–2023) average household income by state
state_longrun_income <- state_summary %>%
filter(YEAR >= 2000, YEAR <= 2023) %>%
group_by(STATEFIP, state_name) %>%
summarize(
avg_household_income = weighted.mean(avg_household_income, sample_size, na.rm = TRUE),
.groups = "drop"
)
# Join long-run income to map data
map_income_longrun <- us_states %>%
left_join(
state_longrun_income,
by = c("state" = "state_name")
)
# Clean labels and round values
map_income_longrun$avg_household_income <- round(map_income_longrun$avg_household_income, 0)
map_income_longrun$state <- stringr::str_to_title(map_income_longrun$state)
# Plot: Long-Run Average Household Income, 2000–2023
p_income_longrun <- ggplot(map_income_longrun, aes(long, lat)) +
geom_polygon(aes(group = group,
fill = avg_household_income,
text = paste0(state, ":\n$",
format(avg_household_income, big.mark = ","))),
colour = "white") +
coord_fixed(1.3) +
scale_fill_viridis_c(option = "plasma") +
labs(title = "Long-Run Average Household Income by State, 2000–2023",
fill = "Income ($)") +
theme_void()
ggplotly(p_income_longrun, tooltip = "text")
```
Regression Analysis
===
Column {.tabset data-width=600}
---
### Model 1 (Explaining Hourly Wages)
```{r}
# For Model 1, looking into what determines average hourly wage
model1 <- lm(avg_hourly_wage ~ pct_employed + pct_unemployed + pct_female + pct_white + pct_black + avg_age + pct_citizen + YEAR + region, data = state_summary)
summary(model1)
```
### Model 2 Results (Explaining Household Income)
```{r}
# Model 2, looking into what determines average household income
model2 <- lm(log(avg_household_income) ~ pct_employed + pct_unemployed +
pct_female + pct_white + pct_black + avg_age + pct_citizen + YEAR + region, data = state_summary)
summary(model2)
```
Column {.tabset data-width=400}
---
### Methods Used
To examine this relationship between economic outcomes and demographics, employment status, and regions, this study uses multiple linear regression models. The two primary dependent variables are hourly wages and average household income by state-year. Model 1 uses average hourly wages as the dependent variable, while Model 2 uses the natural log of average household income as the dependent, due to the analysis from EDA and making interpretation easier (through approximate percent changes).
Both models include measures of employment such as employment and unemployment rates as well as demographic controls such as: age, gender, race, and citizenship status makeups in states. A time trend by year is also added to control for the changes in wages and income over time (increases due to inflation, showed in EDA). Regional differences are also accounted for using the Northeast, South, West, and Midwest regions as the Northeast region is the reference category in both models.
Data cleaning and variable construction was performed mainly using the tidyverse package. The regression models and diagnostics were created using basic R functions.
### Model 1 (Explaining Hourly Wages) Analysis
Model 1 examines how demographic, employment, and regional characteristics explain average hourly wages across U.S. states. The employment rate has a positive effect with its coefficient of 10.71 indicating that a 1 percentage point increase in a state’s employment rate is associated with an increase of about $0.11 in the average hourly wage, holding all other variables constant. Unemployment also carries a positive coefficient of 18.04, suggesting that states experiencing higher unemployment may be raising wages to attract or retain workers, an effect that aligns with short-run labor-market theory.
Regional differences are also significant. Relative to the Northeast (the omitted category), states in the Midwest have hourly wages that are about $1.21 lower, and states in the South average $1.46 lower, even after controlling for demographics, labor-force characteristics, and year. The West shows the smallest gap at a wage of –$0.28 compared to the Northeast, implying that Western states are closer to Northeastern wage levels once the controls are accounted for. These findings indicate that there is meaningful regional wage disparities independent of workforce composition and demographic makeup. Overall, the Northeast and West on average have significantly higher hourly wages than the Midwest and South.
Several demographic characteristics also contribute to differences in state-level wages. The share of females in the state has a significant negative coefficient of –36.64, implying that a 1 percentage point increase in the female population share is associated with roughly a $0.37 decrease in average wages, consistent with gender wage disparities across states. The percentage of the black population is strongly positive (4.32), meaning a 1 percentage point increase in the Black population share corresponds to about a $0.04 increase in wages, while the percent of the white population is not statistically significant. Average age has a positive effect as each additional year in average age predicts a $0.15 increase in the average hourly wage.
Overall, with an R-squared of 0.825, Model 1 explains a large share of hourly wage variation in US states from 2000-2023. Employment condition and regions emerge as the most influential drivers of wage differences, supplemented by other meaningful demographic effects.
### Model 2 (Explaining Household Income) Analysis
Model 2 uses the natural log of household income as the dependent variable, meaning the coefficients can be interpreted as approximate percentage changes. This model fits extremely well, with an R-squared of 0.895, indicating that employment, demographics, regional factors, and time trends collectively explain 89.5% of the variation in household income.
The coefficient on the employment rate of 2.068 implies that a 1 percentage point increase in the employment rate is associated with about a 2.1% increase in average household income, holding other factors constant. Unemployment also predicts higher income, with a coefficient of 0.956, suggesting that income tends to rise in periods or states where unemployment is elevated, which is again consistent with the results from Model 1.
Regional effects are highly significant in this model as well. Compared to the Northeast, household incomes are 11.9% lower in the Midwest, 14.1% lower in the South, and 9.9% lower in the West, even after controlling for demographics, employment conditions, and time. This shows that there are apparent regional income gaps and that they are not simply a function of workforce makeup or labor-market strength.
Demographic characteristics also shape household income. Percent of the white population has a significant negative coefficient of –0.181, meaning a 1 percentage point increase in the white population share predicts a 0.18% decrease in income, consistent with the fact that some of the highest-income states are more racially diverse from the EDA and Model 1. The black population shows the opposite pattern as each percentage point increase in the Black population is associated with a 0.19% increase in income. The female population is not statistically significant, but average age again has a positive coefficient, implying states with older populations tend to have higher incomes.
The year variable is very strong as the coefficient of 0.032 implies income increases by roughly 3.2% per year, capturing long-term growth from inflation, productivity, and the rising cost of living.
Model 2 shows that household income is shaped by labor markets, sustained income growth over time, and meaningful regional differences. Demographic factors play a less but still important role in explaining state-level variation in household income.
Diagnostics
===
Column {.tabset data-width=600}
---
### Diagnostic Plots for Model 1
```{r}
par(mfrow = c(2, 2))
plot(model1)
```
### Diagnostic Plots for Model 2
```{r}
par(mfrow = c(2, 2))
plot(model2)
```
### Multicollinearity Check
***Model 1***
```{r}
vif(model1)
```
***Model 2***
```{r}
vif(model2)
```
Column {data-width=400}
---
### Analysis
***Model 1***
The diagnostic plots for Model 1 show that the regression fits reasonably well, though some assumptions are only partially met. The Residuals vs. Fitted plot shows linearity in the relationship between the predictors and hourly wages as the line remains mostly on the horiztonal plane, shwing linearity. The Scale–Location plot also indicates the line staying mostly on plane, satisfying this condition of constant variance. However, the Q–Q plot shows that most residuals follow a normal distribution, with only slight deviation in the upper tail which could question the condition of normality. Finally, the Residuals vs. Leverage plot reveals no influential outliers, as no points approach high Cook’s distance values. Overall, Model 1 performs adequately, but the diagnostics suggest a very small departure from normality that should be noted when interpreting results.
***Model 2***
The diagnostic plots for Model 2 indicate that the log transformation of household income substantially improved the model. The Residuals vs. Fitted plot shows no meaningful curvature, suggesting that the condition of linearity is met. The Scale–Location plot demonstrates constant variance across fitted values, while the Q–Q plot shows residuals closely following the normal line with no deviation at the tails, meeting the condition of normality. The Residuals vs. Leverage plot also indicates that there are no outliers or influential points past the Cook's distance. Together, these diagnostics show that Model 2 meets regression assumptions much more strongly than Model 1 and is a good predictor of economic outcome, in this case being average household incomes by state.
***Multicollinearity***
To evaluate multicollinearity among predictors in my models, I am used generalized VIF (GVIF) values because the model includes categorical predictors like regions. After converting the GVIF to an interpretable adjusted scale, I can look into multicollinearity.
The VIF results for Model 1 show no evidence of harmful multicollinearity, as all of the VIF fall well below 5. The employment variables central to the model such as pct_employed with a VIF of 1.5 and pct_unemployed with a VIF of 1.21 display very low multicollinearity, meaning their effects can be interpreted reliably. Demographic variables with VIFs such as pct_black (2.07), pct_white (1.54), pct_female (1.61), and avg_age (1.65) also display low levels of multicollinearity, which is expected given that demographic characteristics often are different across states. The region factor also shows a low adjusted GVIF of 1.37, indicating it is not strongly correlated with demographics or employment status. Overall, Model 1’s predictors are sufficiently independent, and multicollinearity does not threaten the validity of this model.
Model 2 shows a nearly identical multicollinearity pattern, again indicating no problematic overlap among predictors. Employment variables show low VIF values, with pct_employed (1.48) and pct_unemployed (1.23) displaying low multicollinearity. The demographic predictors also exhibit low adjusted GVIFs as: pct_black (2.06), pct_female (1.61), pct_white (1.54), and avg_age (1.70) all remain below the level of 5. The region factor again shows a low GVIF (1.37), confirming that regional differences are not related to the effects of other variables. Because essentially all predictors remain below a GVIF of 2, Model 2’s coefficients can be interpreted with confidence, and multicollinearity is not a concern for model 2.
Results
===
Column {data-width=650}
---
### Findings
This study finds clear differences in economic outcomes across U.S. states that are tied to labor market conditions, demographic compositions, and regional location. One of the main findings is the strong association between employment levels (employed and unemployment rates) and economic performance (wages and income). States with higher proportions of employed individuals tend to exhibit higher average wages and household incomes, showing the importance of labor force participation and how it helps generate economic growth in states.
Geographic location also emerges as a key factor shaping economic outcomes. States in the Northeast and West generally display higher wages and household incomes compared to those in the Midwest and South. These differences persist over time and reflect structural problems such as industry concentration, urbanization, and regional economic development. The results suggest that regional economic advantages continue to play an important role in shaping opportunities and living standards across the country.
Demographic patterns provide additional context for these economic differences. States with older average populations tend to have higher incomes, as they have more work experience and progress through their careers over time. Variation in gender composition, racial composition, and citizenship rates is also associated with differences in economic outcomes, reflecting broader patterns of migration, workforce composition, and labor market sorting.
Overall, the findings emphasize that strong labor markets and regional factors are central drivers of wage and income disparities across states. These insights have important implications for policymakers seeking to promote economic growth, reduce inequality, and improve labor market outcomes at the state level.
Column {data-width=350}
---
### Limitations
This study has several limitations. Because the analysis uses state-level averages, it cannot capture differences within states, such as variation between urban and rural areas (as I have one observation per state-year). The results describe relationships rather than effects as other outside factors like state policies, cost of living, and industries relevant to the area may also influence wages and income. Including these variables in future studies would be beneficial to capture the full effects. Also, my data only spans from 2000-2023, which leaves out other historical trends from the past that could be relevant to explain this relationship. Finally, the regional groupings are broad and may mask important differences among states within the same region. Capturing more regions would give better results for future studies.
### Resources
Team, MPC UX/UI. “U.S. Census Data for Social, Economic, and Health Research.” IPUMS USA, usa.ipums.org/usa/. Accessed 10 Dec. 2025.
U.S. Census Bureau. (2023). Census regions and divisions of the United States. https://www.census.gov/
Author
===
### About Myself
My name is Mark Burns and I am a current Senior here at the University of Dayton from Cleveland, Ohio, majoring in Economics with minors in Data Analytics and Finance. I will be graduating in May of 2026.
I interned this past summer as a Business Analytics intern, focusing on creating budget reports using Excel and using PowerBI for further insight.
Please feel free to connect with me on LinkedIn!:
[Visit my LinkedIn profile](https://www.linkedin.com/in/markdburns2)