I am trying to find out what is state of the art with database, python, and big data.
My starting point began with a SQL server, and multiprocessing pandas, and dask. Imagine I need to maintain a database with more than 1 billion rows, and I need to keep inserting into it and even perform multiprocessing, larger than memory complex analysis on them.
Some drawbacks includes that, SQL server is very slow in inserting data and extracting data.
Inserting 100k rows takes 1 second, reading 1M head rows takes 5s+. The speed is very dissatisfactory compared with dask with parquet. However, for dask with parquet, I cannot keep inserting into this “more than 1 billion rows database”. Multiindex/None-clustered index is also not supported even making some previously fast sql join slower….
I looked around and found apache sql, pyspark. But I’m a bit unsure if that is the correct step forward. Any suggestion? Thanks!
Ten-tools.com may not be responsible for the answers or solutions given to any question asked by the users. All Answers or responses are user generated answers and we do not have proof of its validity or correctness. Please vote for the answer that helped you in order to help others find out which is the most helpful answer. Questions labeled as solved may be solved or may not be solved depending on the type of question and the date posted for some posts may be scheduled to be deleted periodically. Do not hesitate to share your response here to help other visitors like you. Thank you, Ten-tools.