Note: This notebook was updated on Tuesday 16th September, 3:00 PM to correct a data preprocessing error, which erroneously filtered out some data from being including in the summaries. The correction is viewable here. This has been corrected.
This notebook contains the data analysis for the charts and numbers used on our site, CBFC.WATCH. To ensure full transparency with our readers, we are publishing our methodology and the R code used to generate the statistics for various parts of the site, as well as to provide a starting point for others who may be curious about how to use the data.
We explore these questions:
Importing the dataset, correcting data types, and standardizing categorical information like language and regional office names.
data <- read_csv("https://github.com/diagram-chasing/censor-board-cuts/raw/refs/heads/master/data/data.csv",
col_types = cols(.default = "c")) %>%
mutate(
cert_date = as.Date(cert_date),
total_modified_time_secs = as.numeric(total_modified_time_secs),
deleted_secs = as.numeric(deleted_secs),
replaced_secs = as.numeric(replaced_secs),
inserted_secs = as.numeric(inserted_secs)
) %>%
# The 'certifier' column contains a string like "Examining Committee, Mumbai".
# We extract the regional office name by splitting the string by the comma
# and taking the last element.
mutate(
office = str_split(certifier, ",") %>%
map_chr(last) %>%
str_trim()
) %>%
separate_rows(ai_content_types, sep = "\\|") %>%
mutate(ai_content_types = str_trim(ai_content_types))%>%
# Filter out any rows where the content type is empty after separation.
# filter(ai_content_types != "") %>%
# Standardize language names to correct for typos and variations in the raw data
mutate(
language = case_when(
language == "Oriya" ~ "Odia",
language == "Gujrati" ~ "Gujarati",
language == "Chhatisgarhi" ~ "Chhattisgarhi",
language == "Hariyanvi" ~ "Haryanvi",
language == "Hindi Dub" ~ "Hindi Dubbed",
TRUE ~ language
)
) %>%
filter(!is.na(language))Counting how many films for each language in the dataset. We’ll show the top ones.
films_by_language_data <- data %>%
distinct(id, language) %>%
count(language, sort = TRUE) %>%
top_n(10, n) %>%
mutate(language = fct_reorder(language, n))
write_json(films_by_language_data, "films_by_language.json", pretty = TRUE, auto_unbox = TRUE)
films_by_language_data %>%
ggplot(aes(x = n, y = language)) +
geom_col(fill = tertiary_color, alpha = 0.9) +
geom_text(aes(label = comma(n)), hjust = -0.15, size = 3.5, color = "gray20") +
scale_x_continuous(
labels = comma,
expand = expansion(mult = c(0, 0.12))
) +
labs(
title = "Top 15 Languages by Number of Films Censored",
subtitle = "Hindi, Telugu, and Tamil are the languages with the most films",
x = "Number of Unique Films",
y = NULL
) +
theme_cbfc()The most common reasons for modifications, based on the AI-classified content types.
mods_by_content_data <- data %>%
filter(!is.na(ai_content_types)) %>%
count(ai_content_types, sort = TRUE) %>%
top_n(15, n) %>%
mutate(
pretty_name = str_replace_all(ai_content_types, "_", " ") %>% str_to_title(),
pretty_name = fct_reorder(pretty_name, n)
)
write_json(mods_by_content_data, "modifications_by_content.json", pretty = TRUE, auto_unbox = TRUE)
mods_by_content_data %>%
ggplot(aes(x = n, y = pretty_name)) +
geom_col(fill = secondary_color, alpha = 0.9) +
geom_text(aes(label = comma(n)), hjust = -0.15, size = 3, color = "gray20") +
scale_x_continuous(
labels = comma,
expand = expansion(mult = c(0, 0.1))
) +
labs(
title = "Most Common Reasons for Film Modifications",
x = "Number of Modifications",
y = "Content Category"
) +
theme_cbfc()We’ve classified each modification log into verbs depending on what action was taken. For example, adding a smoking disclaimer might be an ‘Insertion’ but removing an entire scene is ‘Deletion’. There are also replacements, audio modifications, visual modifications (such as blurs), and so on. We can look at the general distribution for each of these edits.
duration_summary <- data %>%
filter(total_modified_time_secs > 0, !is.na(ai_action)) %>%
group_by(ai_action) %>%
summarise(
min = min(total_modified_time_secs, na.rm = TRUE),
q1 = quantile(total_modified_time_secs, 0.25, na.rm = TRUE),
median = median(total_modified_time_secs, na.rm = TRUE),
q3 = quantile(total_modified_time_secs, 0.75, na.rm = TRUE),
max = max(total_modified_time_secs, na.rm = TRUE),
count = n(),
.groups = 'drop'
) %>%
mutate(pretty_name = str_replace_all(ai_action, "_", " ") %>% str_to_title()) %>%
arrange(median)
# Export sample of raw data for BoxX component
duration_boxplot_data <- data %>%
filter(
total_modified_time_secs > 0,
total_modified_time_secs < 1400, # Found that anything above this is mostly incorrectly entered data
!is.na(ai_action),
ai_action != 'content_overlay',
movie_name != 'ARISHADVARGA' # this movie clearly has a data processing error where the times are added cumulatively https://archive.org/details/cbfc-ecinepramaan-100020292100000261
) %>%
mutate(pretty_name = str_replace_all(ai_action, "_", " ") %>% str_to_title()) %>%
group_by(ai_action, pretty_name) %>%
# Sample up to 500 points. If a group has < 500 points, it will take all of them.
slice_sample(n = 500) %>%
ungroup() %>%
select(
action = ai_action,
category = pretty_name,
duration = total_modified_time_secs
)
write_json(duration_summary, "duration_by_action_summary.json", pretty = TRUE, auto_unbox = TRUE)
write_json(duration_boxplot_data, "duration_by_action_boxplot.json", pretty = TRUE, auto_unbox = TRUE)
duration_boxplot_data %>%
ggplot(aes(
x = fct_reorder(category, duration, .fun = median),
y = duration,
fill = category
)) +
geom_boxplot(show.legend = FALSE, alpha = 0.8, fill = wes_colors[3]) +
scale_y_log10(
breaks = c(1, 10, 60, 300, 1800),
labels = c("1 sec", "10 secs", "1 min", "5 mins", "30 mins")
) +
coord_flip() +
labs(
title = "How Long are Different Types of Edits?",
subtitle = "Distribution of modification durations for each action type on a logarithmic scale.",
x = NULL,
y = "Duration of Modification"
) +
theme_cbfc()The dataset includes tags for what reason a particular modification
was made; violence, profanity and so on. While I have a pretty good idea
of what this will give, we can see what is the breakdown of each reason
by the rating for that movie (rating is the U,
UA, A classification for who can watch the
movie).
rating_breakdown <- data %>%
filter(rating %in% c("U", "UA", "A"), !is.na(ai_content_types)) %>%
mutate(content_category = case_when(
ai_content_types %in% c("sexual_explicit", "sexual_suggestive") ~ "Sexual Content",
ai_content_types == "profanity" ~ "Profanity",
ai_content_types == "violence" ~ "Violence",
ai_content_types == "substance" ~ "Substance Use",
ai_content_types == "religious" ~ "Religious Content",
ai_content_types == "political" ~ "Political Content",
TRUE ~ "Other"
)) %>%
count(rating, content_category, name = "modification_count", sort = TRUE) %>%
group_by(rating) %>%
mutate(percentage = modification_count / sum(modification_count)) %>%
ungroup()
common_types <- rating_breakdown %>%
filter(content_category != "Other") %>%
group_by(content_category) %>%
summarise(total = sum(modification_count), .groups = 'drop') %>%
pull(content_category)
rating_breakdown %>%
filter(content_category %in% common_types) %>%
ggplot(aes(
x = percentage,
y = fct_reorder(content_category, percentage, .desc = FALSE),
fill = rating
)) +
geom_col(color = "white", linewidth = 0.5) +
facet_wrap(~rating, scales = "free_y", ncol = 3) +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_fill_manual(
values = c("U" = wes_colors[1], "UA" = wes_colors[2], "A" = wes_colors[3])
) +
labs(
title = "Censorship reasons by film rating",
subtitle = "Proportional breakdown of modification reasons within each film rating category.",
x = "% of all modifications for this rating",
y = NULL
) +
theme_cbfc() +
theme(
strip.text = element_text(size = 14, face = "bold", color = "gray20", margin = margin(t = 5, b = 10)),
strip.background = element_blank(),
panel.grid.major.x = element_line(color = "gray90", linetype = "dotted"),
panel.spacing.x = unit(2, "lines"),
legend.position = "none"
)Different regional offices of the CBFC may apply censorship standards differently. Here, we calculate the average total time modified per film for each major office. We filter out extreme outliers (films with more than 30 minutes of cuts) and only include offices that have certified a sufficient number of films (at least 50).
film_summary <- data %>%
filter(!is.na(office) & !str_detect(office, "\\.mp4$")) %>%
group_by(id, movie_name, office) %>%
summarise(
total_secs = sum(total_modified_time_secs, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(total_secs > 0, total_secs < 1800)
office_stats <- film_summary %>%
group_by(office) %>%
filter(n() >= 50) %>%
summarise(
film_count = n(),
mean_total_secs = mean(total_secs, na.rm = TRUE),
se = sd(total_secs, na.rm = TRUE) / sqrt(n()),
ci_lower_secs = pmax(0, mean_total_secs - 1.96 * se),
ci_upper_secs = mean_total_secs + 1.96 * se,
.groups = 'drop'
) %>%
arrange(desc(mean_total_secs))
write_json(office_stats, "cuts_by_office.json", pretty = TRUE, auto_unbox = TRUE)
summary_table <- office_stats %>%
mutate(
`Average Time` = format_seconds(mean_total_secs),
`95% CI Lower` = format_seconds(ci_lower_secs),
`95% CI Upper` = format_seconds(ci_upper_secs)
) %>%
select(
Office = office,
`Films Analyzed` = film_count,
`Average Time`,
`95% CI Lower`,
`95% CI Upper`
)
summary_table %>%
gt() %>%
tab_header(
title = "Average Modification Time per Film by CBFC Regional Office",
subtitle = "Analysis of offices with at least 50 certified films"
) %>%
tab_style(
style = list(
cell_fill(color = wes_colors[1], alpha = 0.3),
cell_text(weight = "bold")
),
locations = cells_column_labels()
) %>%
tab_style(
style = cell_text(align = "center"),
locations = cells_body(columns = c(`Films Analyzed`, `Average Time`, `95% CI Lower`, `95% CI Upper`))
) %>%
cols_align(
align = "left",
columns = Office
) %>%
tab_options(
table.font.names = "Atkinson Hyperlegible",
heading.title.font.size = 16,
heading.subtitle.font.size = 12,
column_labels.font.size = 12,
table.font.size = 11
)| Average Modification Time per Film by CBFC Regional Office | ||||
| Analysis of offices with at least 50 certified films | ||||
| Office | Films Analyzed | Average Time | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| Cuttack | 118 | 9m 10s | 7m 38s | 10m 43s |
| Delhi | 442 | 4m 06s | 3m 33s | 4m 38s |
| Mumbai | 6573 | 3m 51s | 3m 42s | 3m 59s |
| Thiruvananthpuram | 688 | 3m 16s | 2m 55s | 3m 36s |
| Chennai | 1657 | 3m 05s | 2m 52s | 3m 19s |
| Kolkata | 209 | 2m 32s | 1m 59s | 3m 04s |
| Bangalore | 982 | 2m 13s | 1m 59s | 2m 27s |
| Hyderabad | 1055 | 2m 01s | 1m 49s | 2m 13s |
Have the reasons for film censorship changed over time? This analysis tracks the changes in different modification types (Violence, Profanity, Sexual Content, etc.) on a monthly basis.
For each category, we calculate its deviation from its own historical average (the “baseline”). This method clearly shows periods when a certain type of content was censored more or less frequently than usual. We also apply a 3-month moving average to smooth out short-term noise and reveal the underlying trend.
interpolate_at_zero <- function(df) {
sign_change <- which(diff(sign(df$deviation_pct_smooth)) != 0)
if (length(sign_change) == 0) return(df)
interpolated <- map_df(sign_change, ~{
slice <- df[.x:(.x + 1), ]
zero_cross_date <- approx(slice$deviation_pct_smooth, as.numeric(slice$year_month), xout = 0)$y
if (is.na(zero_cross_date)) return(tibble())
tibble(
year_month = as.Date(zero_cross_date, origin = "1970-01-01"),
deviation_pct_smooth = 0,
above_baseline = slice$above_baseline[2],
content_type = slice$content_type[1],
baseline_rate = slice$baseline_rate[1]
)
})
bind_rows(df, interpolated) %>% arrange(year_month)
}
chart_data <- data %>%
mutate(
year_month = floor_date(cert_date, "month"),
content_type = case_when(
ai_content_types %in% c("sexual_explicit", "sexual_suggestive") ~ "Sexual Content",
ai_content_types == "profanity" ~ "Profanity",
ai_content_types == "violence" ~ "Violence",
ai_content_types == "substance" ~ "Substance Use",
ai_content_types == "religious" ~ "Religious Content",
ai_content_types == "political" ~ "Political Content",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(content_type)) %>%
group_by(year_month, content_type) %>%
summarise(films_with_mods = n_distinct(certificate_id), .groups = "drop") %>%
left_join(
data %>%
mutate(year_month = floor_date(cert_date, "month")) %>%
group_by(year_month) %>%
summarise(total_films = n_distinct(certificate_id)),
by = "year_month"
) %>%
mutate(rate = films_with_mods / total_films) %>%
filter(total_films >= 10) %>%
group_by(content_type) %>%
arrange(year_month) %>%
mutate(
baseline_rate = mean(rate),
deviation_pct = (rate - baseline_rate) / baseline_rate * 100,
deviation_pct_smooth = zoo::rollmean(deviation_pct, k = 3, fill = NA, align = "center"),
above_baseline = ifelse(is.na(deviation_pct_smooth), NA, deviation_pct_smooth > 0)
) %>%
ungroup() %>%
filter(!is.na(deviation_pct_smooth)) %>%
mutate(content_type = fct_reorder(content_type, baseline_rate, .desc = TRUE)) %>%
group_by(content_type) %>%
group_modify(~ interpolate_at_zero(.x)) %>%
ungroup()
ggplot(chart_data, aes(x = year_month, y = deviation_pct_smooth, color = above_baseline, group = 1)) +
geom_hline(yintercept = 0, color = "grey40", linetype = "dashed") +
geom_line(linewidth = 1.2) +
facet_wrap(~content_type, ncol = 3) +
scale_color_manual(
values = c("TRUE" = wes_colors[5], "FALSE" = wes_colors[1]),
labels = c("Below Baseline", "Above Baseline"),
name = NULL
) +
scale_x_date(
date_labels = "'%y",
date_breaks = "2 years",
expand = expansion(mult = c(0.02, 0.02))
) +
scale_y_continuous(
labels = function(x) paste0(ifelse(x >= 0, "+", ""), round(x, 0), "%")
) +
labs(
title = "Censorship Patterns vs. Historical Average",
caption = "Note: Dashed line is the baseline average for that category. Red indicates months with above-average censorship rates; blue is below-average.\nA 3-month moving average has been applied to smooth trends. Source: Central Board of Film Certification.",
x = NULL,
y = "Deviation from Baseline"
) +
theme_cbfc() +
theme(
plot.caption = element_text(size = 9, color = "grey50", hjust = 0, margin = margin(t = 16)),
panel.grid.major = element_line(color = "grey90", linewidth = 0.4),
legend.position = "bottom"
)This histogram shows the distribution of total modification times for films that had at least one cut. Approximately half of the films in the dataset have ‘zero seconds’ of edits (meaning modifications might have been made but they were not logged with a duration of edit); this chart focuses only on those that have duration values.
film_data <- data %>%
group_by(id) %>%
summarise(
total_modified_time_secs = sum(total_modified_time_secs, na.rm = TRUE),
.groups = 'drop'
)
total_films <- nrow(film_data)
percent_zero <- mean(film_data$total_modified_time_secs == 0, na.rm = TRUE)
film_data_positive <- film_data %>% filter(total_modified_time_secs > 0)
median_val_positive <- median(film_data_positive$total_modified_time_secs)
p_dist <- ggplot(film_data_positive, aes(x = total_modified_time_secs)) +
geom_histogram(
aes(y = after_stat(count) / total_films * 100),
fill = wes_colors[2],
bins = 40,
boundary = 0
) +
geom_vline(
xintercept = median_val_positive,
linetype = "dashed",
color = "gray20",
linewidth = 0.8
) +
# Use a log10 scale on the x-axis because the data is highly skewed.
scale_x_log10(
breaks = c(1, 10, 60, 300, 1800),
labels = c("1s", "10s", "1m", "5m", "30m")
) +
scale_y_continuous(
expand = expansion(mult = c(0, 0.05)),
labels = percent_format(scale = 1)
) +
labs(
title = "Distribution of Modification Times",
subtitle = str_glue("Comparing total modification time for films with at least one cut.
This chart excludes the {percent(percent_zero, accuracy=1)} of films with zero modifications."),
x = "Total Modification Time (Logarithmic Scale)",
y = "Percent of All Films"
) +
theme_cbfc() +
theme(
panel.grid.major.x = element_blank()
)
print(p_dist)# Export the data used for the histogram to a JSON file.
# We use ggplot_build() to extract the computed data from the plot object.
hist_data_for_export <- ggplot_build(p_dist)$data[[1]] %>%
select(x_min = xmin, x_max = xmax, count, y_percent_total = y)
export_data_dist <- list(
histogram_bins = hist_data_for_export,
statistics = list(
median_positive_secs = median_val_positive,
total_films = total_films,
percent_zero_mods = percent_zero
),
axis_config = list(
breaks = c(1, 10, 60, 300, 1800),
labels = c("1s", "10s", "1m", "5m", "30m")
)
)
write_json(export_data_dist, "histogram_data.json", pretty = TRUE, auto_unbox = TRUE)To determine which censorship categories are increasing or decreasing over time, we use a logistic regression model.
A positive slope from the model indicates an increasing trend, while a negative slope indicates a decreasing trend. We then visualize the top three fastest-increasing and fastest-decreasing categories since 2018.
This is based on Julia Silge’s excellent article demonstrating a similar type of analysis.
categories_by_quarter <- data %>%
filter(!is.na(cert_date), !is.na(ai_content_types), cert_date >= as.Date("2018-01-01")) %>%
mutate(quarter = floor_date(cert_date, unit = "3 months")) %>%
count(quarter, ai_content_types, name = "category_count") %>%
group_by(quarter) %>%
mutate(quarter_total_cuts = sum(category_count)) %>%
ungroup() %>%
group_by(ai_content_types) %>%
filter(sum(category_count) > 100) %>%
ungroup()
category_models <- categories_by_quarter %>%
nest(data = -ai_content_types) %>%
# (total - category) as a function of time (quarter).
mutate(
model = map(data, ~ glm(
cbind(category_count, quarter_total_cuts - category_count) ~ quarter,
data = .,
family = "binomial"
))
)
slopes <- category_models %>%
mutate(tidied = map(model, tidy)) %>%
unnest(tidied) %>%
filter(term == "quarter") %>%
arrange(desc(estimate))
trends_to_plot <- bind_rows(
slopes %>% top_n(3, estimate),
slopes %>% top_n(-3, estimate)
)
ggplot(
categories_by_quarter %>% inner_join(trends_to_plot, by = "ai_content_types"),
aes(x = quarter, y = category_count / quarter_total_cuts,
color = fct_reorder2(ai_content_types, quarter, category_count))
) +
geom_line(alpha = 0.8, linewidth = 1.2) +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", linewidth = 0.5) +
facet_wrap(~ ifelse(estimate > 0, "Increasing Trends", "Decreasing Trends"), scales = "free_y") +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = wes_colors) +
labs(
title = "Which Censorship Categories Have Trended Up or Down?",
subtitle = "Proportion of all modifications per quarter since 2018, with a linear trendline.",
x = "Date of Certification",
y = "Percent of All Cuts in Quarter",
color = "Content Type"
) +
theme_cbfc() +
theme(legend.position = "bottom")# Export trending categories data
# Create clean version of trends_to_plot without model objects
trends_clean <- trends_to_plot %>%
select(ai_content_types, estimate, std.error, statistic, p.value)
trending_export <- list(
slopes = slopes %>% select(ai_content_types, estimate, std.error, statistic, p.value),
quarterly_data = categories_by_quarter %>%
inner_join(trends_clean, by = "ai_content_types") %>%
select(quarter, ai_content_types, category_count, quarter_total_cuts,
estimate, std.error, statistic, p.value)
)
write_json(trending_export, "trending_categories.json", pretty = TRUE, auto_unbox = TRUE)This analysis explores which terms tend to appear together within the same film’s modification records.
We calculate the pairwise correlation between the 50 most common keywords and visualize the results as a network graph.
This contains a lot of NSFW stuff, but hiding it would…probably defeat the point.
reference_words <- data %>%
filter(!is.na(ai_reference)) %>%
mutate(word = str_split(tolower(ai_reference), "\\|")) %>%
unnest(word) %>%
filter(!is.na(word), !str_detect(word, "violence|scene|visual|dialogue")) %>%
select(id, word)
word_counts <- reference_words %>%
count(word, sort = TRUE)
word_pairs <- reference_words %>%
filter(word %in% (word_counts %>% top_n(50, n) %>% pull(word))) %>%
pairwise_cor(item = word, feature = id, sort = TRUE)
word_pairs %>%
filter(correlation > 0.05) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = wes_colors[3], size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void() +
labs(
title = "Which Censorship Terms Appear Together?",
subtitle = "Co-occurrence network of the 50 most common keywords in CBFC modification records."
)While the network graph shows which words co-occur, it doesn’t tell us which words are most characteristic of a specific censorship category. For that, we use a metric called Term Frequency-Inverse Document Frequency (TF-IDF).
TF-IDF identifies words that are common within one category (e.g., “blood” in the “violence” category) but are relatively rare in all other categories.
reference_tokens <- data %>%
filter(!is.na(ai_reference), !is.na(ai_content_types)) %>%
select(ai_content_types, ai_reference) %>%
mutate(word = str_split(tolower(ai_reference), "\\|")) %>%
unnest(word) %>%
filter(word != "")
reference_tfidf <- reference_tokens %>%
count(ai_content_types, word, sort = TRUE) %>%
bind_tf_idf(term = word, document = ai_content_types, n = n) %>%
arrange(desc(tf_idf))
top_terms_table <- reference_tfidf %>%
group_by(ai_content_types) %>%
slice_max(order_by = tf_idf, n = 5) %>%
ungroup() %>%
select(Category = ai_content_types, Term = word, `TF-IDF` = tf_idf) %>%
filter(Category %in% c("violence", "profanity", "sexual_suggestive", "substance", "political", "religious"))
top_terms_table %>%
gt() %>%
tab_header(
title = "Most Distinctive Keywords by Censorship Category",
subtitle = "Using Term Frequency-Inverse Document Frequency (TF-IDF) analysis"
) %>%
fmt_number(
columns = `TF-IDF`,
decimals = 3
) %>%
tab_style(
style = list(
cell_fill(color = wes_colors[2], alpha = 0.3),
cell_text(weight = "bold")
),
locations = cells_column_labels()
) %>%
tab_style(
style = cell_text(transform = "capitalize"),
locations = cells_body(columns = Category)
) %>%
tab_style(
style = cell_text(style = "italic"),
locations = cells_body(columns = Term)
) %>%
cols_align(
align = "center",
columns = `TF-IDF`
) %>%
tab_options(
table.font.names = "Atkinson Hyperlegible",
heading.title.font.size = 16,
heading.subtitle.font.size = 12,
column_labels.font.size = 12,
table.font.size = 11,
row_group.font.weight = "bold"
)| Most Distinctive Keywords by Censorship Category | ||
| Using Term Frequency-Inverse Document Frequency (TF-IDF) analysis | ||
| Category | Term | TF-IDF |
|---|---|---|
| political | national flag | 0.020 |
| political | modi | 0.015 |
| political | indian flag | 0.009 |
| political | pakistan | 0.007 |
| political | political leaders | 0.007 |
| profanity | munda | 0.005 |
| profanity | bhenchod | 0.004 |
| profanity | iththa | 0.003 |
| profanity | bhadkov | 0.003 |
| profanity | maal | 0.003 |
| religious | superstition | 0.044 |
| religious | black magic | 0.015 |
| religious | religion | 0.014 |
| religious | superstitions | 0.012 |
| religious | religious sentiments | 0.011 |
| sexual_suggestive | kissing scene | 0.022 |
| sexual_suggestive | cleavage | 0.019 |
| sexual_suggestive | sexual_suggestive | 0.018 |
| sexual_suggestive | kissing | 0.017 |
| sexual_suggestive | love making scene | 0.017 |
| substance | tobacco | 0.028 |
| substance | liquor label | 0.017 |
| substance | akshay kumar | 0.016 |
| substance | liquor labels | 0.015 |
| substance | rahul dravid | 0.014 |
| violence | blood | 0.032 |
| violence | bloodshed | 0.009 |
| violence | killing scene | 0.007 |
| violence | dead bodies | 0.007 |
| violence | dead body | 0.006 |
Identifies the top films that had the most time removed by each major CBFC regional office.
film_level_summary <- data %>%
filter(!is.na(office) & !str_detect(office, "\\.mp4$")) %>%
group_by(id, movie_name, office, language, cert_date, cert_no) %>%
summarise(
total_cuts = n(),
total_time_removed_secs = sum(total_modified_time_secs, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(total_time_removed_secs > 0, total_time_removed_secs < 1800) %>%
mutate(
year = map2_dbl(cert_date, cert_no, ~ extract_year(.x, .y)),
cleaned_name = map_chr(movie_name, clean_name),
slug = map2_chr(cleaned_name, year, make_slug)
)
top_censored_by_office <- film_level_summary %>%
group_by(office) %>%
arrange(desc(total_time_removed_secs)) %>%
slice_head(n = 10) %>%
ungroup() %>%
mutate(
duration_formatted = format_seconds(total_time_removed_secs)
) %>%
arrange(office, desc(total_time_removed_secs)) %>%
select(Office = office, Language = language, `Film Name` = cleaned_name,
`Time Removed` = duration_formatted, `Total Cuts` = total_cuts, Slug = slug)
display_table <- top_censored_by_office %>%
select(-Slug)
display_table %>%
gt() %>%
tab_header(
title = "Top 10 Most Censored Films by Time Removed",
subtitle = "Films with the highest modification times for each regional office"
) %>%
tab_style(
style = list(
cell_fill(color = wes_colors[3], alpha = 0.3),
cell_text(weight = "bold")
),
locations = cells_column_labels()
) %>%
tab_style(
style = cell_text(style = "italic"),
locations = cells_body(columns = `Film Name`)
) %>%
cols_align(
align = "center",
columns = c(Language, `Time Removed`, `Total Cuts`)
) %>%
cols_align(
align = "left",
columns = c(Office, `Film Name`)
) %>%
tab_options(
table.font.names = "Atkinson Hyperlegible",
heading.title.font.size = 16,
heading.subtitle.font.size = 12,
column_labels.font.size = 12,
table.font.size = 11
) %>%
tab_style(
style = cell_borders(
sides = c("top", "bottom"),
color = "lightgray",
weight = px(1)
),
locations = cells_body()
)| Top 10 Most Censored Films by Time Removed | ||||
| Films with the highest modification times for each regional office | ||||
| Office | Language | Film Name | Time Removed | Total Cuts |
|---|---|---|---|---|
| Bangalore | Kannada | NEGILA ODEYA | 26m 22s | 5 |
| Bangalore | Tamil | ROCKY | 24m 16s | 6 |
| Bangalore | Kannada | DAMAYANTHI | 23m 21s | 13 |
| Bangalore | Tamil | GAJAKESSARI | 23m 05s | 8 |
| Bangalore | Hindi | REHNA NAHI BIN TERE | 23m 02s | 1 |
| Bangalore | Tamil | GOOGLY | 22m 03s | 6 |
| Bangalore | Hindi | KISS | 21m 36s | 8 |
| Bangalore | Kannada | M E S T R I (REVISED) | 20m 52s | 23 |
| Bangalore | Tamil | RUSTUM | 20m 19s | 6 |
| Bangalore | Kannada | TIGER GALLI | 19m 45s | 188 |
| Chennai | Telugu | PULIDEBBA (REVISED) | 28m 25s | 18 |
| Chennai | Tamil | PATCHIRAJA PARAVAI | 26m 42s | 4 |
| Chennai | Tamil | 'AALAPIRANTHAVAN' (REVISED) | 25m 23s | 27 |
| Chennai | Tamil | NERAM NALLA NERAM (REVISED) | 25m 01s | 29 |
| Chennai | Tamil | NAAN UNNA NINACHEN (REVISED) | 24m 34s | 28 |
| Chennai | Tamil | THEEVIRAM | 23m 43s | 10 |
| Chennai | Hindi | GHAJINI | 23m 27s | 7 |
| Chennai | Hindi | AANGILA PADAM | 22m 57s | 6 |
| Chennai | Telugu | BANDIPOTU SIMHAM (REVISED) | 22m 47s | 22 |
| Chennai | Telugu | PAGA PATTINA SIMHAM (REVISED) | 22m 28s | 18 |
| Cuttack | Bhojpuri | IDDARAMMAYILATHO | 28m 53s | 7 |
| Cuttack | Odia | TOR MOR LOVE STORY | 27m 57s | 10 |
| Cuttack | Awadhi | VETTAIKAARAN | 26m 25s | 6 |
| Cuttack | Odia | MU LOVER NO - 1 | 26m 10s | 11 |
| Cuttack | Bhojpuri | V I P 2 | 25m 56s | 6 |
| Cuttack | Odia | PREMA TOTE DURU JUHARA | 25m 54s | 7 |
| Cuttack | Odia | JHIATE TO PARI | 25m 43s | 6 |
| Cuttack | Odia | DASAHARA | 23m 23s | 7 |
| Cuttack | Bhojpuri | MISHRA FOR SALE | 23m 12s | 7 |
| Cuttack | Bhojpuri | SOWKHYAM | 23m 02s | 6 |
| Delhi | Hindustani | LAKSHMI | 28m 11s | 6 |
| Delhi | Hindustani | ROWDY MUNNA | 27m 39s | 7 |
| Delhi | Bhojpuri | KHALEJA (MAHESH KHALEJA) | 27m 22s | 7 |
| Delhi | Odia | CHANDRAMUKHI | 27m 00s | 7 |
| Delhi | Hindustani | D J | 26m 50s | 6 |
| Delhi | Hindustani | SUPER | 26m 35s | 7 |
| Delhi | Awadhi | POWER RETURNS ( RACE GURRAM ) | 24m 47s | 5 |
| Delhi | Hindustani | RACHHA | 23m 30s | 5 |
| Delhi | Odia | DHARMADURAI | 22m 31s | 8 |
| Delhi | Odia | TARAK | 22m 28s | 8 |
| Guwahati | Hindi | CAPITAL | 3m 00s | 5 |
| Guwahati | Manipuri | NGAMNABA LANFAMSE | 2m 45s | 8 |
| Guwahati | Bodo | BEKAR ROMEO | 1m 38s | 6 |
| Guwahati | Hindi Dubbed | BARUN RAI AND THE HOUSE ON THE CLIFF | 1m 34s | 10 |
| Guwahati | Assamese | RAKSHAK - THE SAVIOUR | 1m 08s | 3 |
| Guwahati | Assamese | BAD BOYS | 0m 50s | 2 |
| Guwahati | Assamese | RONGATAPU 1982 | 0m 32s | 3 |
| Guwahati | Khasi | KYNJAH | 0m 27s | 7 |
| Guwahati | Hindi | KOOKI | 0m 24s | 6 |
| Guwahati | Assamese | MOI ETI NIXHASOR (KODUWA - THE NIGHTBIRD) | 0m 18s | 1 |
| Hyderabad | Telugu | "NARAKASURA" | 25m 00s | 15 |
| Hyderabad | Telugu | NANI | 23m 51s | 40 |
| Hyderabad | Telugu | ALLUDA MAJAAKAA | 23m 11s | 38 |
| Hyderabad | Malayalam | SIMHA MUGHAM | 20m 46s | 27 |
| Hyderabad | Telugu | "I LOVE YOU IDIOT" | 18m 54s | 1 |
| Hyderabad | Hindi | SAMAJAVARAGAMANA | 18m 39s | 6 |
| Hyderabad | Telugu | AASHAADAM PELLIKODUKU ( A TO UA) | 17m 03s | 19 |
| Hyderabad | Hindi | KONAPURAMLO JARIGINA KATHA | 16m 48s | 7 |
| Hyderabad | Hindi | "SPARK LIFE" | 16m 33s | 16 |
| Hyderabad | Telugu | 1948 AKHANDA BHARATH | 16m 25s | 19 |
| Kolkata | Bengali | BAGHER BACCHA | 28m 24s | 8 |
| Kolkata | Bengali | DESHBHOKTO KHILADI | 27m 57s | 8 |
| Kolkata | Bengali | PYAR KI JUNG | 21m 39s | 6 |
| Kolkata | Bengali | VANDE BHARAT (SAVE INDIA) | 16m 15s | 4 |
| Kolkata | Bengali | MAAHIYA (REVISED) | 13m 23s | 11 |
| Kolkata | Bengali | MAHANAYAK UTTAM KUMAR ( THE METRO STATION ) ( REVISED) | 11m 43s | 9 |
| Kolkata | Bengali | PORNOMOCHI (REVISED) | 10m 24s | 44 |
| Kolkata | Bengali | LAAL RANGER DUNIYA (REVISED) | 9m 36s | 65 |
| Kolkata | Bengali | KOLIJUGER SHIKHONDI | 9m 23s | 4 |
| Kolkata | Bengali | CIRCLE (REVISED) | 9m 19s | 21 |
| Mumbai | Hindustani | MASS | 29m 52s | 7 |
| Mumbai | Bhojpuri | DAROGA BABUNI | 29m 50s | 35 |
| Mumbai | Bhojpuri | JIYAB SHAAN SE | 29m 46s | 6 |
| Mumbai | Hindustani | DON ( DON SEENU) | 29m 44s | 6 |
| Mumbai | Urdu | PALAHATI BRAHMA NAIDU - SHEZADA | 29m 43s | 6 |
| Mumbai | Hindustani | NAKSHATRAM | 29m 43s | 1 |
| Mumbai | Bengali | SAPTAMA INDRIYA | 29m 38s | 6 |
| Mumbai | Hindi | HAI | 29m 37s | 9 |
| Mumbai | Hindustani | AATAADISTHA | 29m 35s | 7 |
| Mumbai | Hindi | DARINGBAAZ KHILADI RETURNS (THOOTTAL POO MALARUM) | 29m 24s | 8 |
| Thiruvananthpuram | Kannada | MARCO | 29m 32s | 2 |
| Thiruvananthpuram | Malayalam | THANKAMANI | 26m 05s | 8 |
| Thiruvananthpuram | Malayalam | ARDHARAATHRI (REVISED) | 25m 10s | 21 |
| Thiruvananthpuram | Malayalam | ARTHARAATHRI PANTHRANDU MUTHAL AARU VARE | 25m 07s | 4 |
| Thiruvananthpuram | Malayalam | PANI | 24m 20s | 5 |
| Thiruvananthpuram | Malayalam | ATTENTION PLEASE | 22m 28s | 3 |
| Thiruvananthpuram | Malayalam | VIRAL 2020 | 22m 24s | 2 |
| Thiruvananthpuram | Malayalam | UDAL | 21m 39s | 69 |
| Thiruvananthpuram | Malayalam | KARIMPULI (RE-REVISED) | 21m 32s | 17 |
| Thiruvananthpuram | Malayalam | ARIKU | 20m 10s | 4 |
For more details, interactive charts, and the full explorer, please visit our project website: https://cbfc.watch
The analysis was conducted by Aman Bhargava and Vivek Matthew for Diagram Chasing.
Bhargava, A., Matthew, V., & Diagram Chasing. (2025). Analyzing Film Censorship in India: CBFC Watch. Retrieved from https://cbfc.watch.
BibTeX Entry
For use in LaTeX documents, you can use the following BibTeX entry:
@misc{Bhargava2025CBFC,
author = {Bhargava, Aman and Matthew, Vivek and {Diagram Chasing}},
title = {Analyzing Film Censorship in India: CBFC Watch},
year = {2025},
month = {September},
howpublished = {\url{https://cbfc.watch}},
note = {Analysis and visualizations by Diagram Chasing. Last accessed: \today}
}