Back to all updates

over 1 year ago

Tracks and Datasets

The theme for the 2024 CSUEB Datathon is Sustainability! Students will work in teams to create data science projects and visualizations using datasets surrounding this theme.

About

Teams will get the change to work on a project focusing on one of three tracks:

  • AgTech
  • Finance
  • EdTech

Each team will choose one track to work with!

Each track has one accompanying dataset to explore. Each dataset will have varying difficulties, unique challenges, and interesting insights. Explore all of the datasets to see which one you would be most interested in.

 

Datasets

1. AgTech

Dataset: Plant Varieties, Soil, and Agricultural Resources Database

Download: AgTech_Dataset

Link for Info: Global Farmers Market

Description: The AgTech track invites participants to develop a machine learning model that predicts the ideal crop variety based on agricultural features. Using the provided database, supplemented with publicly available data, the model should analyze factors like soil type, pH range, salinity levels, water requirements, and organic matter content. The goal is to provide actionable recommendations tailored to specific farm profiles, enhancing crop yield and resource efficiency.

Objective: To develop user-friendly tools and dashboards that support farm decision-making with actionable insights on crop selection, soil management.

Specific Goals:

  • Database Exploration: Analyze the provided agricultural database to understand key features like soil properties, nutrients, and plant characteristics.
  • Feature Engineering: Prepare the dataset by creating meaningful features for model input, such as nutrient availability, soil moisture, and environmental conditions.
  • Model Development: Build a predictive machine learning model to recommend the most suitable crop variety for given conditions.
  • Evaluation Metrics: Evaluate the model using metrics like accuracy, precision, recall, and F1-score to ensure reliable predictions.
  • Dashboard Creation: Develop a dashboard for farmers to input conditions (e.g., soil type, pH, climate) and receive instant crop recommendations.
  • Actionable Insights: Provide explanations for predictions, highlighting factors influencing the recommended crop variety.

TechStack Scope: Python, SQL, Machine Learning, Tableau/PowerBI, Figma

 

2. Finance

Dataset: Market Sales and Financial Performance Data (To be fetched by Participants)

Download: Fintech_Dataset

Link for Info: https://www.sec.gov/search-filings/standard-industrial-classification-sic-code-list

Description: This track emphasizes leveraging financial and market data to develop intuitive, AI-assisted business valuation tools. The dataset includes U.S. companies, their respective Ticker IDs, and Standard Industrial Classification (SIC) codes and their descriptions.

Objective: Participants will source publicly available datasets, such as market sales transactions and industry-specific benchmarks, linked to SIC codes and Ticker IDs. The goal is to design user-friendly tools and dashboards that enable advanced data aggregation, financial analysis, and valuation methodologies. These solutions should simplify complex financial concepts for non-experts, offering clear, actionable insights and fostering a deeper understanding of business valuations.

Specific Goals:

  • Data Sourcing and Aggregation: Collect and consolidate financial data using TickerIDs and SIC codes from publicly accessible resources.
  • Market Sales Analysis: Examine market sales transactions to identify patterns, outliers,and variations across industries, locations, and company sizes.
  • Automated Data Aggregation: Streamline the collection of market sales data using SICcodes with filters for company size, geographic location, and transaction timelines.
  • AI-Driven Financial Guidance: Create tools that allow users to input financial data andgenerate AI-powered business valuation insights, including Discounted Cash Flow(DCF) analysis.
  • Intuitive Dashboards: Design user-friendly dashboards to display market sales trends,industry standards, and valuation metrics specific to various industries or geographies.
  • Benchmarking and Insights: Compare consolidated data with industry benchmarks toidentify gaps, opportunities, and actionable recommendations.
  • TechStack Scope: Python, Tableau/PowerBI, Figma, LLMs

 

3. EdTech

Dataset: Educational Accessibility, Quality, and Outcomes in the U.S. (2015-2019)

Download: EdTech Dataset

Link For Info: UN Sustainable Development Goal 4 - Quality Education

Description: This dataset focuses on the UN Sustainable Development Goal 4 (Quality Education) and provides data on educational accessibility, quality, and outcomes across the U.S. It includes indicators such as literacy rates, enrollment and completion rates, and educational disparities across various regions and socio-economic groups. This dataset aims tosupport data-driven decision-making to enhance educational equality.

Outcomes: Participants are expected to thoroughly understand the dataset structure, analyze the data to identify trends, gaps, and correlations, and create visually appealing and insightful dashboards or visualizations. They should interpret the data to provide clear explanations of trends, disparities, and outcomes, effectively communicating their findings to highlight key insights and actionable recommendations.

Specific Goals

  • Assessment of Educational Access: Understand the reach of free, equitable primary and secondary education and identify areas needing improvement.
  • Quality of Early Education: Evaluate early childhood development metrics to ensure children’s readiness for primary education.
  • Skill Development: Measure progress in technical and vocational training initiatives supporting employment.
  • Gender and Inclusivity Metrics: Identify gender disparities and assess inclusivity for vulnerable groups.
  • TechStack Scope: Python, SQL, R, Tableau/PowerBI, Figma, Microsoft Excel

 

Getting Started

Once you download your dataset, you may want to get started using data processing tools to begin exploring your data! Depending on which tool you would like to use, here are some options:

Microsoft Excel: Excel is a great tool for beginners to explore data, since it involves no coding and a clean, easy-to-use graphical user interface. If you are new to data science, consider exploring and analyzing your data with this tool. Excel is available for free for all UNC students as part of Microsoft Office.

Tableau: Tableau is an extremely powerful data analysis and visualization tool. Tableau is also great to explore since it involves little to no coding and a clean user interface. If you are relatively new to data science, or are an intermediate who would like to have a tool to assist you with creating data visualizations, consider downloading Tableau! You can download Tableau Public for free here. UNC also offers paid Tableau to all students, however it may take longer to configure.

Python: Python is by far the most popular programming language for data science. Python can do everything, from data exploration to analysis to creating amazing visualizations and dashboards. If you have taken a Python programming class before, we strongly encourage you to try using Python out for your project!

SQL: SQL (Structured Query Language) is essential for managing and querying relational databases. It allows you to efficiently retrieve, manipulate, and analyze data stored in databases. SQL is great for intermediate to advanced users who need to handle large datasets and perform complex queries. Many data science tasks begin with data extraction from databases using SQL. If you're looking to deepen your data handling skills, learning SQL is a valuable step.

Figma: Figma is a powerful design tool primarily used for creating user interfaces and prototypes. However, it can also be leveraged for data visualization and presenting insights in a visually appealing manner. Figma allows for collaboration, making it easy to share and work on designs with others in real-time. If you are interested in creating sophisticated, interactive, and visually engaging presentations of your data insights, Figma is a great tool to explore.

 

Feel free to join the support/Q&A room to resolve your queries and doubts until 9:00PM today: Zoom

 

Happy Hacking,

Business Analytics Club