In the initial times, when I started learning about data and data science, I am still in chaos regarding what is dataset.
Sometimes, it hard to go through the dataset or researching about dataset. More often, still clarity is missing. I started looking and searching and started applying and applied once in R programming language. I struggled and paused in the midst. But, still I have mind-set and curiosity to learn.
Very few questions, I started asking myself.
How to understand the dataset?
How to clean it?
How to go through large dataset?
How the dataset been formed/created?
How to identify the pitfalls and major mistakes?
I completely misunderstood. A dataset supposed to be only in Microsoft Excel format. I was completely wrong.
A dataset could be images, videos and even more precisely in Excel CSV and XLSX too.
Please correct me, if I am wrong, downloading, cleaning and sometimes manipulating, mining datasets and forming the appropriate algorithms (for example, Machine learning algorithms) with predictive analytics “the output will be derived.
So, here are some of the concept of dataset and let’s have a look and download some of the most common datasets.
What is a dataset?
A dataset, or data set, is simply a collection of data.
The simplest and most common format for datasets you’ll find online is a spreadsheet or CSV format — a single file organized as a table of rows and columns. But some datasets will be stored in other formats, and they don’t have to be just one file. Sometimes a dataset may be a zip file or folder containing multiple data tables with related data.
A data set consists of roughly two components. The two components are rows and columns. Additionally, a key feature of a data set is that it is organized so that each row contains one observation.
How are datasets created?
Different datasets are created in different ways. In this post, you’ll find links to sources with all kinds of datasets. Some of them will be machine-generated data. Some will be data that’s been collected via surveys. Some may be data that’s recorded from human observations. Some may be data that’s been scraped from websites or pulled via APIs.
Whenever you’re working with a dataset, it’s important to consider: how was this dataset created? Where does the data come from? Don’t jump right into the analysis; take the time to first understand the data you are working with.
Creating a Dataset
It’s easy to create a dataset on Kaggle and doing so is a great way to start a data science portfolio, share reproducible research, or work with collaborators on a project for work or school. You have the option to create private datasets to work solo or with invited collaborators or publish a dataset publicly to Kaggle for anyone to view, download, and analyze.
Types of Datasets
Kaggle supports a variety of dataset publication formats, but we strongly encourage dataset publishers to share their data in an accessible, non-proprietary format if possible. Not only are open, accessible data formats better supported on the platform, they are also easier to work with for more people regardless of their tools.
This page describes the file formats that we recommend using when sharing data on Kaggle Datasets. Plus, learn why and how to make less well-supported file types as accessible as possible to the data science community.
7 public data sets you can analyze for free right now.
1. Google Trends.
2. National Climatic Data Center.
3. Global Health Observatory data.
4. Data.gov.sg.
5. Earthdata.
6. Amazon Web Services Open Data Registry.
7. Pew Internet.
SOURCE: https://www.dataquest.io/blog/free-datasets-for-projects/ https://towardsdatascience.com/what-is-a-data-set-9c6e38d33198 https://www.kaggle.com/docs/datasets https://www.tableau.com/learn/articles/free-public-data-sets.
With respect.