Bollywood films and gender inequality

Actor-actress age gaps, career and marriage, female-lead films, genres.
Data science
India

Introduction

The lead male actor (the “hero”) and the lead female actor (the “heroine”) are important roles in a film, and more so in Indian films. In mainstream Indian films, it has been noticed that the actor who plays the hero (I’ll interchangeably refer to the actor as lead male actor or hero, the context should make it clear) is often a lot older compared to the actor who plays the heroine, but the ages of the characters they play in the film are comparable.

Does this age gap exist across decades and in various regions of Indian cinema? What are the differences between the careers of heroes and heroines? How does marriage affect them? In this project, I’ll delve into such issues and understand some aspects of gender inequalities in Indian cinema.

A deeper dive into Bollywood

In this project, I consider Bollywood and investigate age gaps, careers and other such aspects. First, I extracted information about Hindi films and their cast from 1940 till 2023 June from Wikipedia pages. For each Hindi film, I made an initial list of film title, year it got released and the main cast. I extracted this information from the yearly film list pages on Wikipedia. From this main cast list for each film, I identified the “hero” and the “heroine”. This requires one to infer the gender of each cast member. The way I did this is by looking at their Wikipedia pages if available (or by performing an automated search on DuckDuckGo if Wikipedia page is unavailable), and counting the number of occurrences of the words actor, he, him, his as compared to actress, she, her.

For a similar analysis of Telugu films, go here.

Identifying the hero and heroine of a film from such a main cast comes with some asssumptions and issues. The main assumption is make by default is that the first male cast member is the hero and the first female cast member is the heroine. However,

1.) Many films have multiple heroes and heroines.

In such cases, I only assign one hero and one heroine. Further the heroes and heroines can be “paired” in any combination. The cast order doesn’t inform us about which hero is paired with which heroine.

For example, दृश्यम 2 (Drishyam) from 2022 has cast is listed as Ajay Devgn, Akshaye Khanna, Tabu, Shriya Saran. Our method would identify Ajay Devgn as the hero and Tabu as the heroine, while Ajay Devgn and Shriya Saran are paired together.

2.) Some films don’t have a hero or a heroine.

The assumption that each film necessarily has a hero and a heroine itself doesn’t hold true for a few films. For example, An Action Hero from 2022 has Ayushmann Khurrana and Jaideep Ahlawat as the main cast, with no heroine credited. Jalsa from 2022 has Vidya Balan and Shefali Shah in the lead roles, with no hero credited.

3.) What should the relationship be between the hero and the heroine?

Does relationship necessarily have to be a romantic one? Or does importance to the film’s plot take prominence over a romantic pairing of the potential hero and the potential heroine?

Take Pink from 2016 for example. Amitabh Bachchan and Taapsee Pannu are (correctly) identified as the hero and the heroine, but they don’t share a romantic relationship.

My approach

The approach I took is the following. The first cast member is always the “lead” - it is mostly the hero, but sometimes the heroine. The other (heroine in cast the lead is the hero, otherwise the hero), is the next cast member of the opposite gender. In most cases, this covers both (a) the romantic hero-heroine pair, and (b) the non-romantic hero and heroine. Sometimes, the lead has a romantic interest in some other case member that is not the identified heroine/hero. To tackle such ambiguities, I created two datasets: (a) High+ confidence dataset, in which the hero and the heroine are unambiguously the first and second cast members and (b) Low+ confidence dataset, in which there is some ambiguity in the hero and the heroine because of the cast order. The High+ confidence dataset will exclude multistarrer movies as identifying the hero (or the heroine) difficult. While the overall statistics change only very little between thye datasets, you can still switch between the datasets here.

Ages and age gaps over the years

Once the hero and heroine were identified, I extracted their birth years (along with their wedding years and place of birth if available) from their Wikipedia pages and from searches on DuckDuckGo and Google. An automated process that relies on DuckDuckGo search to extract this information is riddled with inaccuracies as (a) often there are many people who share similar names, or (b) the cast members are only credited with a short name/nickname that is ambiguous, or (c) the birth year on the internet is not reliable. Hence, I mainly relied on Wikipedia, but fell back to DuckDuckGo + manual curation for a handful of heroes and heroines who have starred in multiple films.

Using the birth years, I calculated the ages of the hero and the heroine for each movie.


Because of the issues I mentioned earlier, not all movies have both the hero’s age and the heroine’s age available. Here’s their availability over the years.

Hero grid: age-gaps for each hero

To make the visualization of this age-gap data easier, I made individual plots for each major hero/heroine showing the age-gap over their careers. Click on a hero to view his career age-gap trajectory:

Heroine grid: age-gaps for each heroine

Click on a heroine to view her career age-gap trajectory:

Debuts and exits

Careers and marriages

I only considered the year of first marriage in case an actor was married multiple times. Also, not knowing the marriage year doesn’t automatically mean that the actor is unmarried. So it is difficult to consider the careers of unmarried actors.

Female-lead films and genres

For each year, I also checked what fraction of films have the first cast member as a female, meaning the heroine is credited before the hero. I’ve taken this as a sign that the film is a female-lead one.

Actors as heroes/heroines vs as cast members